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Preface 


he 1990s have witnessed a continuation and expansion of the upsurge of in- 
| terest in psychological testing that was evident in the 1980s. This growth is 
indicated by the development of new tests—some representing novel ap- 
proaches—as well as by the revisions of earlier tests and continued research 
on existing tests. In selecting instruments for mention or discussion, a primary 
goal was to expose readers to the rich variety of measurement tools available in 
the field today, as well as to some historically significant tests and techniques; any 
attempt to cover the field—or even a significant segment of it—in a comprehen- 
sive fashion would have been clearly beyond the scope of this book. 

Increasing attention is being given to the individual who is tested. The test 
user is encouraged to search for the causes of the individual's performance in that 
person's experiential history and reactional biography: What about the individ- 
ual’s background may help to understand that person's test responses and enhance 
the value of scores in predicting subsequent behavior—in school, at work, and in 
other contexts of everyday-life activities? It follows that the test user has increased 
responsibility in selecting appropriate tests and methods of presenting testing to 
the individual; in interpreting scores; and in communicating and using test re- 
sults. Because of such considerations, this textbook is designed especially with the 
goal of providing a foundation for proper test use. 

Effective test use requires some basic familiarity with test construction. Such infor- 
mation is needed in order to evaluate different tests, choose tests appropriate for par- 
ticular purposes and individual examinees, and to interpret scores properly. Although 
this book is not addressed specifically to the test constructor, it nevertheless includes 


12 PREFACE 


enough information about how tests are developed to meet the needs of the test user. 

This edition also provides simple explanations of some widely used and rapidly 
developing concepts and procedures, which are likely to affect psychometric prac- 
tices in the 21st century. Examples include computerized adaptive testing, meta- 
analysis, structural equation modeling, the use of confidence intervals in place of 
the traditional statistical significance, cross-cultural testing, and the increasing use 
of factor analysis in the development of ability and personality tests. A practical 
application of the factor-analytic approach has been to provide norms that permit 
interpretation of scores at different levels of specificity or generality, so that a test 
user can apply the level that is most suitable for the particular person or situation. 

Two significant long-term trends are abundantly evident in testing today; in- 
stead of devoting separate chapters to these, we chose to discuss them wherever 
they were pertinent, throughout the entire book. The first trend is the constantly 
increasing influence of computerization in test development, test construction, 
and test administration, in addition to the well-established uses of computers in 
test scoring and score processing. The velocity of technological advances is so 
startling that it seems to outpace progress in substantive areas of psychology. 
Technology, however, also contributes mightily to progress in both the research 
and theoretical fronts. For instance, a great deal of integration and cross-fertiliza- 
tion of areas within the psychological literature is rapidly taking place, fostered by 
the ease with which information can be accessed, processed, and communicated 
among investigators around the world. The recasting of cognitive and personality 
traits as interacting and inseparable aspects of the individual—who is, in turn, in- 
extricably tied to her or his physical self, life events, and environment—is one of 
the most salient, and promising, instances of this tendency toward integration. 

A second trend that affects psychological testing in very significant ways is the 
growing encroachment of political and legalistic concerns in the area. While this is 
a divisive and potentially pernicious influence on the field, it may also have some 
positive ramifications in terms of fostering creativity and increased vigilance with 
respect to the intended and unintended consequences of the use of tests. Citations 
of a number of public laws (P.L.) that have had an impact on testing practices are 
given throughout the text—along with their titles and the years they were enacted; 
their content can be found in the United States Congressional Record as well as other 
periodical publications available in the reference departments of most libraries. 

Although the first six editions of this text had a single author, this edition is a 
true co-authorship. The two authors jointly planned the reorganization of chap- 
ters and the major topics to be covered. The specific revision and re-writing of 
chapters was allocated as follows: Anastasi was responsible for chapters 1-7 and 
10-12; Urbina for chapters 8, 9, and 13-18; Urbina also undertook major admin- 
istrative and correspondence functions. Nevertheless, each author read a draft of 
the others chapters and suggested improvements, which were usually imple- 
mented. It also should be noted that the goal of balancing the order of precedence 
for the feminine and the masculine genders—in pronouns and other words—was 
achieved by using the alphabet as a guide. Thus, "her" precedes "his," "he" pre- 
cedes “she,” “female” precedes “male,” “man” precedes “woman,” and so on. 


Preface 13 


Obviously this book could not have been written without access to the re- 
search and publications of many psychologists, both present and past, from differ- 
ent parts of the United States and other countries. Their names appear 
throughout the text: in the citations of their publications, in the acknowledge- 
ments of the sources of figures and tables, and in the composite list of references at 
the end of the book. Within this impressive group, a few individuals nevertheless 
stand out because of their ever-ready cooperation and the extent of their contri- 
butions. Conspicuous among them are Dianne Brown from the Science Direc- 
torate of the American Psychological Association, Aurelio Prifitera and Joanne 
Lenke of the Psychological Corporation, Lorin Letendre of Consulting Psycholo- 
gists Press, Carol Watson of National Computer Systems, Douglas Jackson of 
Sigma Assessment Systems, Elizabeth McGrath and John Oswald of the Riverside 
Publishing Company, and Wayne Camara of the College Board. In addition, the 
assistance of staff members of the Fordham University and University of North 
Florida libraries, in meeting the varied challenges involved in the preparation of 


this book, is gratefully acknowledged. 
A.A. 
S.U. 
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Nature and Use 
of Psychological Tests 


sychological tests are tools. To reap the benefits that tests can provide, one 
must keep this essential fact in mind. Any tool can be an instrument of 
good or harm, depending on how it is used. Testing has been growing at an 
increasing pace, and it is contributing effectively in more and more areas of 
daily life." But this growth has been accompanied by some unrealistic expecta- 
tions and misuses. Test users need to know how to evaluate tests. How good is this 
test for the particular purpose for which it is being used? What kind of informa- 
tion can it provide about the person who takes it? How can its results be inte- 
grated into the network of data leading to action decisions? It is primarily with. 
such questions in view that this book has been prepared. The book is not de- 
signed for the test specialist but rather for the general student of psychology. 
Today, some basic knowledge about tests is needed —not only by those who con- 
struct or give tests but also by anyone who uses test results as one source of data in 
reaching decisions about oneself or others. 


USES AND VARIETIES OF PSYCHOLOGICAL TESTS 


Traditionally, the function of psychological tests has been to measure differences 
between individuals or between the reactions of the same individual under differ- 
ent circumstances. One of the earliest problems that stimulated the development 
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of psychological tests was the identification of mentally retarded persons. To this 
day, the detection of intellectual deficiencies remains an important application of 
certain types of psychological tests. Related clinical uses of tests include the ex- 
amination of persons with severe emotional disorders and other types of behav- 
ioral problems. A strong impetus to the early development of tests was likewise 
provided by assessment needs arising in education. This was the case in the fa- 
mous Binet tests that ushered in intelligence testing. At present, schools are 
among the largest test users. The classification of children with reference to their 
ability to profit from different types of school. instruction, the identification of 
outstandingly slow or fast learners, the educational and occupational counseling 
of high school and college students, and the selection of applicants for profes- 
sional schools are among the many educational uses of tests. 

The selection and classification of industrial personnel represent another 
major application of psychological testing. From the assembly-line operator or fil- 
ing clerk to top management, there is scarcely a type of job for which some kind 
of psychological test has not proved helpful in such matters as hiring, job assign- 
ment, transfer, promotion, or termination. To be sure, the effective employment 
of tests in many of these situations, especially in connection with high-level jobs, 
usually requires that the tests be used as an adjunct to skillful interviewing, so that 
test scores may be properly interpreted in the light of other background informa- 
tion about the individual. Nevertheless, testing constitutes an important part of 
the total personnel program. A closely related application of psychological testing 
is to be found in the selection and classification of military personnel. From sim- 
ple beginnings in World War L the scope and variety of psychological tests em- 
ployed in military situations underwent a phenomenal increase during World War 
II. Subsequently, research on. test development has been continuing on a large 
scale in all branches of the armed services. 

The use of tests in individual counseling has gradually broadened from a 
narrowly defined guidance regarding educational and vocational plans to an 
involvement with all aspects of the person's life. Emotional well-being and effec- 
tive interpersonal relations have become increasingly prominent objectives of 
counseling. There is growing emphasis, too, on the use of tests to enhance self- 
understanding and personal development. Within this framework, test scores are 
part of the information given to the individual as aids to her or his own decision- 
making processes. 

It is clearly evident that psychological tests are currently being employed in the 
solution of a wide range of practical problems. One should not, however, lose sight 
of the fact that such tests are also serving important functions in basic research. 
Nearly all problems in differential psychology, for example, require testing proce- 
dures as a means of gathering data. As illustrations, reference may be made to 
studies on the nature and extent of individual differences, the organization of psy- 
chological traits, the measurement of group differences, and the identification of 
biological and cultural factors associated with behavioral differences. For all such 
areas of research—and for many others—the precise measurement of individual 
differences made possible by well-constructed tests is an essential prerequisite. 
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Similarly, psychological tests provide standardized tools for investigating such 
varied problems as life-span developmental changes within the individual, the 
relative effectiveness of different educational procedures, the outcomes of psy- 
chotherapy, the impact of community programs, and the influence of environ- 
mental variables on human performance. 

The many kinds of tests designed for these diverse purposes differ also in other 
major characteristics. They vary in the way they are administered, as in the indi- 
vidual testing of each person by a trained examiner, the simultaneous testing of 
large groups, or the administration of tests by computers. Tests also differ in the 
aspects of behavior that they cover. Some concentrate on the assessment of cog- 
nitive traits, or abilities. These may range from broad aptitudes—such as readi- 
ness to profit from college work—to highly specific sensorimotor skills required to 
perform a simple manual operation. Other tests provide measures’ of affective 
variables, or personality, such as emotional or motivational traits, interpersonal 
behavior, interests, attitudes, and values. 

In the face of such diversity in nature and purpose, what are the common 
differentiating characteristics of psychological tests? How do psychological 
tests differ from other methods of gathering information about individuals? 
The answer is to be found in certain fundamental features of both the con- 
struction and the use of tests. It is with these features that the present chapter 
is concerned. 


WHAT IS A PSYCHOLOGICAL TEST? 


Behavior Sample. A psychological test is essentially an objective and 
standardized measure of a sample of behavior. Psychological tests are like the tests 
in any other science, insofar as observations are made op a small but carefully 
chosen sample of an individual’s behavior. In this respect, the psychologist 
proceeds in much the same way as the biochemist who tests a patient’s blood or a 
communitys water supply by analyzing one or more samples of it. If the 
psychologist wishes to test the extent of a child’s vocabulary, a clerk’s ability to 
perform arithmetic computations, or a pilot’s eye-hand coordination, he or she 
examines their performance with a representative set of words, arithmetic 
problems, or motor tests. Whether or not the test adequately covers the behavior 
under consideration obviously depends on the number and the nature of items in 
the sample. For example, an arithmetic test consisting of only five problems, or 
one including only multiplication items, would be a poor measure of the 
individual’s computational skill: A vocabulary test composed entirely of baseball 
terms would hardly provide a dependable estimate of a child’s total range of 
vocabulary. 

The diagnostic or predictive value of a psychological test depends on the degree 
to which it serves as an indicator of a relatively broad and significant area of be- 
havior. Measurement of the behavior sample directly covered by the test is rarely, 
if ever, the goal of psychological testing. The child’s knowledge of a particular list 
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of 50 words is not, in itself, of great interest. Nor is the job applicant's perfor- 
mance on a specific set of 20 arithmetic problems of much importance. If, how- 
ever, it can be demonstrated that there is a close correspondence between the 
child’s knowledge of the word list and his total mastery of vocabulary, or between 
the applicant’s score on the arithmetic problems and her computational perfor- 
mance on the job, then the tests are serving their purpose. 

Tt should be noted in this connection that the test items need not resemble 
closely the behavior the test is to predict. It is necessary only that an empirical 
correspondence be demonstrated between the two. The degree of similarity be- 
tween the test sample and the predicted behavior may vary widely, At one ex- 
treme, the test may coincide completely with a part of the behavior to be 
predicted. An example might be a foreign-vocabulary test in which the students 
are examined on 20 of the 50 new words they have studied; another example is 
provided by the road test taken prior to one's obtaining a driver's license. A lesser 
degree of similarity is illustrated by many vocational aptitude tests administered 
prior to job training, in which there is only a moderate resemblance between the 
tasks performed on the job and those incorporated in the test. At the other ex- 
treme one finds projective personality tests such as the Rorschach inkblot test, in 
which an attempt is made to, predict from the respondent's associations to 
inkblots how he or she will react to other people, to emotionally toned stimuli, 
and to other complex, everyday-life situations. Despite their superficial differ- 
ences, all these tests consist of samples of the individual’s behavior. And each 
must prove its worth by an empirically demonstrated correspondence between the 
examinee’s performance on the test and in other situations. 

Whether the term “diagnosis” or the term “prediction” is employed in this con- 
nection also represents a minor distinction. Prediction commonly connotes a 
temporal estimate—for example, individuals’ future performance on a job being 
forecast from their present test performance. In a broader sense, however, even 
the diagnosis of present condition, such as mental retardation or emotional disor- 
der, implies a prediction of what the individual will do in situations other than 
the present test. It is logically simpler to regard all tests as behavior samples from 
which predictions regarding other behavior can be made. Different types of tests 
can then be characterized as variants of this basic pattern. 

Another point that should be considered at the outset pertains to the concept 
of capacity. It is entirely possible, for example to devise a test for predicting how 
well an individual can learn French before he or she has even begun the study of 
French. Such a test would involve a sample of the types of behavior required to 
learn the new language, but would in itself presuppose no knowledge of French. It 
could then be said that this test measures the individual's "capacity" or “potential- 
ity" for learning French. Such terms should, however, be used with caution in ref- 
erence to psychological tests. Only in the sense that a present behavior sample 
can be used as an indicator of other, future behavior can we speak of a test mea- 
suring "capacity." No psychological test can do more than measure behavior. 
Whether such behavior can serve as an effective index of other behavior can be 
established only by empirical tryout. 
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Standardization. lt will be recalled that in the initial definition a psychologi- 
cal test was described as a standardized measure. Standardization implies unifor- 
mity of procedure in administering and scoring the test. If the scores obtained by 
different persons are to be comparable, testing conditions must obviously be the 
same for all. Such a requirement is only a special application of the need for 
controlled conditions in all scientific observations. In a test situation, the single 
independent variable is often the individual being tested. 

In order to secure uniformity of testing conditions, the test constructor pro- 
vides detailed directions for administering each newly developed test. The formu- 
lation of directions is a major part of the standardization of a new test. Such 
standardization extends to the exact materials employed, time limits, oral instruc- 
tions, preliminary demonstrations, ways of handling queries from test takers, and 
every other detail of the testing situation. Many other, more subtle factors may in- 
fluence performance on certain tests. The, in giving instructions or presenting 
problems orally, consideration must be given to the rate of speaking, tone of 
voice, inflection, pauses, and facial expression. In a test involving the detection of 
absurdities, for example, the correct answer may be given away by smiling or paus- 
ing when the crucial word is read. Standardized testing procedure, from the exam- 
iners point of view, will be discussed further in a later section of this chapter 
dealing with problems of test administration. 

Another important step in the standardization of a test is the establishment of 

"norms. Psychological tests have no predetermined standards of passing or failing; 
performance on each test is evaluated on the basis of empirical data. For most pur- 
poses, an individual's test score is interpreted by comparing it with the scores ob- 
tained by others on the same test. As its name implies, a norm is the normal or 
average performance. Thus, if normal 8-year-old children complete 12 out of 50 
problems correctly on a particular arithmetic reasoning test, then the 8-year-old 
norm on this test corresponds to a score of 12. The latter is known as the raw 
score on the test. It may be expressed as number of correct items, time required to 
complete a task, number of errors, or some other objective measure appropriate to 
the content of the test. Such a raw score is meaningless until evaluated in terms of 
suitable interpretive data. 

In the process of standardizing a test, it is administered to a large, representa- 
tive sample of the type of persons for whom it is designed. This group, known as 
the standardization.sample, serves to establish the norms. Such norms indicate 
not only the average performance but also the relative frequency of varying de- 
grees of deviation above and below the average. It is thus possible to evaluate dif- 
ferent degrees of superiority and inferiority. The specific ways in which such 
norms may be expressed will be considered in chapter 3. All permit the designa- 
tion of the individual's position with reference to the normative or standardiza- 
tion sample. 

It might also be noted that norms are established for personality tests in 
essentially the same way as for aptitude tests. The norm on a personality test is 
not necessarily the most desirable or "ideal" performance, any more than a 
perfect or errorless score is the norm on an aptitude test. On both types of tests, 
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the norm corresponds to the performance of typical or average persons. On 
dominance-submission tests, for example, the norm falls at an intermediate point 
representing the degree of dominance or submission manifested by the average 
person. Similarly, in an emotional adjustment inventory, the norm does not ordi- 
narily correspond to a complete absence of unfavorable or maladaptive responses. 
A few such responses occur in the majority of “normal” individuals in the stan- 
dardization sample, and this number of maladaptive responses would thus repre- 
sent the norm. 


Objective Measurement of Difficulty. Reference to the definition of a psy- 
chological test with which this discussion opened will show that such a test was 
characterized as an objective as well as a standardized measure. In what specific 
ways are such tests objective? Some aspects of the objectivity of psychological 
tests have already been touched on in the discussion of standardization. Thus, the 
administration, scoring, and interpretation of scores are objective insofar as they 
are independent of the subjective judgment of the particular examiner. Any one 
test taker should theoretically obtain the identical score on a test regardless of 
who happens to be the examiner. This is not entirely so, of course, because perfect 
standardization and objectivity have not been attained in practice. But at least 
such objectivity is the goal of test construction and has been achieved to a 
reasonably high degree in most tests. 

There are other major ways in which psychological tests can be properly de- 
scribed as objective. The determination of the difficulty level of an item or of a 
whole test is based on objective, empirical procedures. When Binet and Simon 
prepared their original, 1905 scale for the measurement of intelligence, they 
arranged the 30 items of the scale in order of increasing difficulty. Such difficulty 
was determined by trying out the items on 50 normal and a few mentally retarded 
children. The items correctly solved by the largest number of children were, ipso 
facto, taken to be the easiest; those passed by relatively few children were re- 
garded as more difficult items. By this procedure, an empirical order of difficulty 
was established. This early example typifies the objective measurement of diffi- 
culty level, which is now common practice in psychological test construction. 

Not only the arrangement but also the selection of items for inclusion in a test 
can be guided by the proportion of persons in the trial samples who pass each 
item. Thus, if there is a bunching of items at the easy or difficult end of the scale, 
some items will be discarded. Similarly, if items are sparse in certain portions of 
the difficulty range, new items can be added to fill the gaps. More technical as- 
pects of item analysis are considered in chapter 7. 


Reliability. How good is this test? Does it really work? These questions 
could—and occasionally do—result in long hours of futile discussion. Subjective 
opinions, hunches, and personal biases may lead, on the one hand, to extravagant 
claims regarding what a particular test can accomplish and, on the other hand, to 
stubborn rejection. The only way questions such as these can be conclusively 
answered is by empirical trial. The objective evaluation of psychological tests, 
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involves primarily the determination of the reliability and the validity of the test 
in specified situations. 

As used in psychometrics, the term "reliability" basically means consistency. 
Test reliability is the consistency of scores obtained by the same persons when 
retested with the identical test or with an equivalent form of the test. If a child re- 
ceives an IQ of 110 on Monday and an IQ of 80 when retested on Friday, it is ob- 
vious that little or no confidence can be put in either score. Similarly, if in one set 
of 50 words an examinee identifies 40 correctly, whereas in another, supposedly 
equivalent set she gets a score of only 20 right, then neither score can be taken as 
a dependable index of her verbal comprehension. To be sure, in both illustrations 
it is possible that only one of the two scores is in error, but this could be demon- 
strated only by further retests. From the given data, we can conclude only that 
both scores cannot be right. Whether one or neither is an adequate estimate of 
the individual’s ability in vocabulary cannot be established without additional in- 
formation. 

Before a psychological test is released for general use, a thorough, objective 
check of its reliability should be carried out. The different types of test reliability, 
as well as methods of measuring each, will be considered in chapter 4. Reliability 
may be checked by comparing the scores obtained by the same test takers at differ- 
ent times, with different sets of items, with different examiners or scorers, or under 
any other relevant testing condition. It is essential to specify the type of reliability 
and the method employed to determine it, because the same test may vary in these 
different aspects. The number and nature of persons on whom reliability. was 
checked should likewise be reported. With such information, test users can predict. 
whether the test will be about equally reliable for the group with which they ex- 
pect to use it, or whether it is likely to be more reliable or less reliable. 


Validity. Undoubtedly the most important question to be asked about any 
psychological test concerns its validity—that is, the degree to which the test 
actually measures what it purports to measure. Validity provides a direct check on 
how well the test fulfills its function. The determination of validity usually 
requires independent, external criteria of whatever the test is designed to measure. 
For example, if a medical aptitude test is to be used in selecting promising 
applicants for medical school, ultimate success. in medical school would be a 
criterion. In the process of validating such a test, it would be administered to a 
large group of students at the time of their admission to medical school. Some 
measure of performance in medical school would eventually be obtained for each 
student on the basis of grades, ratings by instructors, success or failure in 
completing training, and the like. Such a composite measure constitutes the 
criterion with which each student’s initial test score is to be correlated. A high 
correlation, or validity coefficient, would signify that those individuals who scored 
high on the test had been relatively successful in medical school, whereas: those 
scoring low on the test had done poorly in medical scheol. A low correlation 
would indicate little correspondence between test score and criterion measure 
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and hence poor validity for the test. The validity coefficient enables us to 
determine how closely the criterion performance could have been predicted from 
the test scores. 

In a similar manner, tests designed for other purposes can be validated against 
appropriate criteria. À vocational aptitude test, for example, can be validated 
against on-the-job success of a trial group of new employees. A pilot aptitude bat- 
tery can be validated against achievement in flight training. Tests designed for 
broader and more varied uses are validated against a number of independently ob- 
tained behavioral indices; and their validity can be established only by the grad- 
ual accumulation of data from many different kinds of investigations. 

The reader may have noticed an apparent paradox in the concept of test valid- 
ity. If it is necessary to follow up the test takers or in other ways to obtain inde- 
pendent measures of what the test is trying to predict, why not dispense with the 
test? The answer to this riddle is to be found in the distinction between the vali- 
dation group on the one hand and the groups on which the test will eventually be 
employed for operational purposes on the other. Before the test.is ready for use; its 
validity must be established on.a representative sample of persons. The scores of 
these persons are not themselves employed for operational purposes but serve 
only in the process of testing the test. If the test proves valid by this method, it 
can then be used on other samples in the absence of criterion measures. 

It might still be argued that we would need only to wait for the criterion mea- 
sure to mature, to become available, on any group in order to obtain the informa- 
tion that the test is trying to predict. But such a procedure would be so wasteful 
of time and energy as to be prohibitive in most instances. Thus, we could deter- 
mine which applicants will succeed on a job or which students will satisfactorily 
complete college by admitting all who apply (or a random sample of them) and 
waiting for subsequent developments! It is the very wastefulness of this proce- 
dure—and its deleterious emotional impact on individuals—that tests are de- 
signed to minimize. By means of tests, the person's present level of prerequisite 
skills, knowledge, and other relevant characteristics can-be assessed with a deter- 
minable margin of error. The more valid and reliable the test, the smaller will be 
this margin of error. 

The special problems encountered in determining the validity of different 
types of tests, as well as the specific criteria and statistical procedures employed, 
are discussed in chapters 5 and 6. One further point, however, should be consid- 
ered at this time. Validity tells us more than the degree to which the test is fulfill- 
ing its function. It actually tells us what the test is measuring. By studying the 
validation data, we can objectively determine what the test is measuring. It would 
thus be more accurate to define validity as the extent to which we know what the 
test measures. The interpretation of test scores would undoubtedly be clearer and 
less ambiguous if tests were regularly named in terms of the empirically estab- 
lished relationships through: which they had been validated. A tendency in this 
direction can be recognized in such test labels as “scholastic assessment test” and 
“personnel classification test” in place of the vague title “intelligence test.” 
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WHY CONTROL THE USE OF PSYCHOLOGICAL TESTS? 


"May I have a Stanford-Binet blank? My nephew has to take it next week for 
admission to School X and I'd like to give him some practice so he can 
pass." 

"To improve the reading program in our school, we need a culture-free IQ test 
that measures each child's innate potential." 

"Last night I answered the questions in an intelligence test published in a mag- 
azine and I got an IQ of 80—I think psychological tests are silly.” 

“My roommate is studying psych. She gave me a personality test and I came out 
neurotic. I've been too upset to go to class ever since.” 

"Last year you gave a new personality test to our employees for research pur- 
poses. We would like to have the scores for their personnel folders." 


Qualified Examiner. The need fora qualified examiner is evident in each of 
the three major aspects of the testing situation: selection of the test, admin- 
istration and scoring, and interpretation of scores. Tests cannot be chosen like 
lawn mowers, from a mail-order catalog. They cannot be evaluated by name, 
author, or other easy marks of identification. To be sure, it requires no psychologi- 
cal training to consider such factors as cost, bulkiness and ease of transporting test 
materials, testing time required, and ease and rapidity of scoring. Information on 
these practical points can usually be obtained from a test catalog and should be 
taken into account in planning a testing program. For the test to serve its 
function, however, an evaluation of its technical merits in terms of such 
characteristics as validity, reliability, difficulty level, and norms is essential. Only 
in such a way can test users determine the appropriateness of any test for their 
particular purpose and its suitability for the type of persons with whom they plan 
to use it. 

The introductory discussion of test standardization earlier in this chapter has 
already suggested the importance of a trained examiner. An adequate realization 
of the need to follow instructions Precisely, as well as a thorough familiarity with 
the standard instructions, is required if the test scores obtained by different exam- 
iners are to be comparable or if any one individual's score is to be evaluated in 
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terms of the published norms. Careful control of testing conditions is also essen- 
tial. Similarly, incorrect or inaccurate scoring may render the test score worthless, 
In the absence of proper checking procedures, scoring errors are far more likely to 
occur than is generally realized, 

The proper interpretation of test scores requires a thorough understanding of 
the test, the test taker, and the testing conditions. What is being measured can be 
objectively determined only by reference to the specific procedures through 
which the particular test was validated. Other information—pertaining to relia- 
bility, nature of the group on which norms were established, and the like—is like- 
wise relevant. Some background data regarding the individual being tested are 
essential in interpreting any test score. The same score may be obtained by differ- 
ent persons for very different reasons. The conclusions to be drawn from such 
scores would therefore be quite dissimilar, Finally, some consideration must also 
be given to special factors that may have influenced a particular score, such as un- 
usual testing conditions, the temporary emotional or physical state of the test 
taker, and the extent of the test taker's previous experience with tests. 


Role of the Test User. A significant development in psychological testing in 
the 1980s and 1990s was the increasing recognition of the key role of the test user 
(Anastasi, 1990b). In this context, the test user is anyone who uses test scores as 
one source of information in reaching practical decisions. The test user may or 
may not be the examiner who administers and scores the test. Examples of test 
users are teachers, counselors, administrators in school systems, and personnel 
workers in industry or government. Most criticisms of tests are directed, not to 
intrinsic features of the tests, but to misuses of test results by inadequately 
qualified users. Some misuses stem from a desire for shortcuts, quick answers, and 
simple routine solutions for real-life problems. Time pressure from work overload 
may encourage reliance on such expedients. Probably the most frequent cause of 
test misuse, however, is insufficient or faulty knowledge about testing (Eyde, 
Moreland, Robertson, Primoff, & Most, 1988; Moreland, Eyde, Robertson, 
Primoff, & Most, 1995; Tyler & Miller, 1986). 

Special committees of national professional organizations, working jointly 
with test publishers, have been giving increasing attention to the prevention of 
test misuse. A notable example is the project conducted by the Test User Qualifi- 
cations Working Group, familiarly known by the delightful acronym TUQWoG 
(Eyde et al., 1988). The chief goal of TUQWOG was to develop an empirical, 
data-based set of essential qualifications for users of different types of tests that 
test publishers. could incorporate into their test purchaser qualification forms. 
Through five years of intensive, nationwide research, the TUQWAOG project de- 
veloped an impressive database. Some publishers have already begun to use the 
results in their purchaser qualification forms. A second working group was subse- 
quently formed whose purpose was to use the available TUQW0G database to de- 
velop guidelines and training materials for test users. Known by the modified 
acronym of TUTWoG (With a T for training), it had as its first product a case- 
book of common test misuses to guard against (Eyde et al., 1993). The cases are 
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based on specific instances of misuses actually observed in a variety of settings 
that were submitted in the project's wide-ranging survey. A later summary is given 
by Moreland et al. (1995). 


Security of Test Content and Communication of Test Information. 
Obviously, if one were to memorize the correct responses on a test of color 
blindness, such a test would no longer be a measure of color vision for that person. 
Under these conditions, the test would be completely invalidated. Test content 
clearly has to be restricted in order to forestall deliberate efforts to fake scores. In 
other cases, however, the effect of familiarity may be less obvious, or the test may 
be invalidated in good faith by misinformed persons. A schoolteacher, for 
example, may give a class special practice in problems closely resembling those on 
an intelligence test, "so that the children will be well prepared to take the test." 
Such an attitude is simply a carry-over from the usual procedure of preparing for a 
school examination. When applied to an intelligence test, however, it is likely 
that such specific training or coaching will raise the scores on the test without 
appreciably affecting the broader area of behavior the test tries to sample. Under 
such conditions, the validity of the test as a predictive or diagnostic instrument is 
reduced. 

Ensuring the security of specific test content need not-—and should not— 
interfere with the effective communication of testing information to test takers, 
concerned professionals, and the general public. Such communication serves sev- 
eral purposes. First, it tends to dispel any mystery that may have become associ- 
ated with testing and thereby helps to correct prevalent misconceptions about 
what tests are designed to do and what their scores mean. A number of clearly 
written publications distributed by some of the major test publishers are designed 
for this purpose. A second type of communication is concerned with the techni- 
cal procedures whereby particular tests were constructed and evaluated; they 
present the relevant data about reliability, validity, and other psychometric prop- 
erties of the test. This type of information is typically included in the technical 
manual prepared for each test and is available to any interested person. 

A third purpose of test communication is to familiarize test takers with testing 
procedures, dispel anxiety, and ensure that each will perform to the best of her or 
his ability. Several explanatory booklets have been prepared for this purpose, 
some of a general nature and others for specific tests such as the College Board's 
Scholastic Assessment Test. These materials will be discussed further in a later 
section of this chapter. A fourth and highly significant type of communication is 
the feedback provided to test takers regarding their own performance on any test 
they have taken. Psychologists have given considerable attention to the most use- 
ful and meaningful ways of conveying such information in different contexts. Ap- 
propriate procedures are examined in chapters 17 and 18. 

The dissemination of information about testing is of fundamental importance. 
There are helpful as well as harmful ways of carrying it out. An example of the lat- 
ter is provided by some overhasty legislative attempts to introduce government 
controls at both state and federal levels (Bersoff, 1981, 1983; B. Lerner, 1980b). 
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State laws regulating the disclosure of testing information were actually enacted 
in the late 1970s in California and New York. The New York law, which was the 
more extreme of the two, required the unlimited disclosure of test questions and 
answers in all large-scale testing programs for admission to institutions of higher 
learning. 

Because such a disclosure requirement necessitates the preparation of a new 
form of each test for each administration, it can have any of several adverse ef- 
fects. These include, among others, fewer available testing dates during the year, 
increase in applicant fees, and decline in the quality-control procedures that can 
be followed in test construction and in equating scores from tests given at differ- 
ent times. It is also noteworthy that very few test takers avail themselves of the 
test disclosure opportunity, and that retest performance on another form is not 
significantly improved by such disclosure (Stricker, 1984). The desirable goals 
that motivated the proposal of test disclosure laws can be attained more effec- 
tively and without deleterious side effects by strengthening available procedures 
for communicating information about tests. 


TEST ADMINISTRATION 


The basic rationale of testing involves generalization from the behavior sample 
observed in the testing situation to behavior manifested in other, nontest situa- 
tions. A test score should help us to predict how the client will feel and act out- 
side the clinic, how the student will achieve in college courses, and how the 
applicant will perform on the job. Any influences that are specific to the test situ- 
ation constitute error variance and reduce test validity. It is therefore important 
to identify any test-related influences that may limit or impair the generalizability 
of test results. 

A whole volume could easily be devoted to a discussion of desirable procedures 
of test administration. But such a survey falls outside the scope of this book. 
Moreover, it is more practicable to acquire such techniques within specific set- 
tings, because no one person would normally be concerned with all forms of test- 
ing, from the examination of infants to the clinical testing of psychotic patients 
or the administration of a mass testing program for military personnel. The pre- 
sent discussion will therefore deal principally with the common rationale of test 
administration rather than with specific questions of implementation. An excel- 
lent example of implementation can be found in the comprehensive discussion of 
the individual assessment of children given by Sattler (1988, chap. 5). 


Advance Preparation of Examiners. The most important single require- 
ment for good testing procedure is advance preparation. In testing there can be 
no emergencies. Special efforts must therefore be made to foresee and forestall 
emergencies. Only in this way can uniformity of procedure be assured. 

Advance preparation for the testing session takes many forms. Memorizing the 
exact verbal instructions is essential in most individual testing. Even in a group 
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test in which the instructions are read to the test takers, some previous familiarity 
with the statements to be read prevents misreading and hesitation and permits a 
more natural, informal manner during test administration. The preparation of test 
materials is another important preliminary step. In individual testing and espe- 
cially in the administration of performance tests, such preparation involves the 
actual layout of the necessary materials to facilitate subsequent use with a mini- 
mum of search or fumbling. Materials should generally be placed on a table near 
the testing table so that they are within easy reach of the examiner but do not dis- 
tract the test taker. When complex apparatus is employed, frequent periodic 
checking and calibration may be necessary. In group testing, all test blanks, an- 
swer sheets, special pencils, or other materials needed should be carefully 
counted, checked, and arranged in advance of the testing day. 

Thorough familiarity with the specific testing procedure is another important 
prerequisite in both individual and group testing. For individual testing, super- 
vised training in the administration of the particular test is usually essential. De- 
pending upon the nature of the test and the type of persons to be examined, such 
training may require from a few demonstration and practice sessions to more than 
a year of instruction. For group testing, and especially in large-scale projects, the 
preparation may include advance briefing of examiners and proctors, so that each 
is fully informed about the functions he or she is to perform. In general, the exam- 
iner reads the instructions, takes care of timing, and is in charge of the group in 
any one testing room. The proctors hand out and collect test materials, make cer- 
tain that instructions are followed, answer the individual questions of test takers 
within the limitations specified in the manual, and prevent cheating. 


Testing Conditions. Standardized procedure applies not only to verbal in- 
structions, timing, materials, and other aspects of the tests themselves but also to 
the testing environment. Some attention should be given to the selection of a 
suitable testing room. This room should be free from undue noise and distraction 
and should provide adequate lighting, ventilation, seating facilities, and working 
space for test takers. Special steps should also be taken to prevent interruptions 
during the test. Posting a sign on the door to indicate that testing is in progress is 
effective, provided all personnel have learned that such a sign means no 
admittance under any circumstances. In the testing of large groups, locking the 
doors or posting an assistant outside each door may be necessary to prevent the 
entrance of latecomers. 

It is important to realize the extent to which testing conditions may influence 
scores. Even apparently minor aspects of the testing situation may appreciably 
alter performance. Such a condition as the use of desks or of chairs with desk arms, 
for example, proved to be significant in a group testing project with high school 
students, the groups using desks tending to obtain higher scores (T. L. Kelley, 
1943; Traxler & Hilkert, 1942). There is also evidence to show that the type of 
answer sheet employed may affect test scores (E. O. Bell, Hoff, & Hoyt, 1964). Be- 
cause of the establishment of. independent test-scoring and data-processing agen- 
cies that provide their own machine-scorable answer sheets, examiners sometimes 
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administer group tests with answer sheets other than those used in the standard- 
ization sample. In the absence of empirical verification, the equivalence of these 
answer sheets cannot be assumed. In testing children below the fifth grade, the 
use of any separate answer sheet may significantly lower test scores (Cashen & 
Ramseyer, 1969; Ramseyer & Cashen, 1971). At these grade levels, having the 
child mark the answers in the test booklet itself is generally preferable. 

Even more significant at any age level are the possible differences between 
paper-and-pencil and computer administration of the same tests. Considerable at- 
tention has been devoted to the effect of such a difference in test administration 
on norms, reliability, and validity, in relation to the nature of the test and the 
population of test takers. Special professional guidelines have been formulated to 
aid test users in assessing the comparability of test scores obtained under the two 
types of administration (Butcher, 1987; Hofer & Green, 1985). 

Many other, more subtle testing conditions have been shown to affect perfor- 
mance on ability as well as on personality tests. Whether the examiner is a 
stranger or someone familiar to the test takers may make a significant difference 
in test scores (Sacks, 1952; Tsudzuki, Hata, & Kuze, 1957). In another study, the 
general manner and behavior of the examiner, as illustrated by smiling, nodding, 
and making such comments as “good,” or “fine,” were shown to have a decided ef. 
fect on test results (Wickes, 1956). In a projective test requiring the respondent 
to write stories to fit given pictures, the presence of the examiner in the room 
tended to inhibit the inclusion of strongly emotional content in the stories (Bern- 
stein, 1956). In the administration of a typing test, job applicants typed at a sig- 
nificantly faster rate when tested alone than when tested in groups of two or more 
(Kirchner, 1966). 

Examples could readily be multiplied. The implications are threefold. First, fol- 
low standardized procedures to the most minute detail. It is the responsibility of 
the test author and publisher to describe such procedures fully and clearly in the 
test manual. Second, record any unusual testing conditions, however minor. 
Third, take testing conditions into account when interpreting test results. In the 
extensive assessment of a person through individual testing, an experienced ex- 
aminer may occasionally depart from the standardized test procedure in order to 
elicit additional information for special reasons. In such a case, the test results can 
no longer be interpreted in terms of the test norms. Under these circumstances, 
the test stimuli are used only for qualitative exploration; and the responses should 
be treated in the same way as any other informal behavioral observations or inter- 
view data. 


Introducing the Test: Rapport and Test-Taker Orientation. In test ad- 
ministration, "rapport" refers to the examiner's efforts to arouse the test takers' 
interest in the test, elicit their cooperation, and encourage them to respond in a 
manner appropriate to the objectives of the test. In ability tests, the objective 
calls for careful concentration on the given tasks and for putting forth one's best 
efforts to perform well. In self-report personality inventories, the objective calls 
for frank and honest responses to questions about one's usual behavior; in certain 
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projective tests, it calls for full reporting of associations evoked by the stimuli, 
without any censoring or editing of content. Still other kinds of tests may require 
other approaches. But in all instances, the examiner endeavors to motivate the 
respondents to follow the instructions as fully and conscientiously as they can. 

The training of examiners covers techniques for the establishment of rapport 
as well as those more directly related to test administration. In establishing 
rapport, as in other testing procedures, uniformity of conditions is essential for 
comparability of results. If a child is given a coveted prize whenever she solves a 
test problem correctly, her performance cannot be directly compared with the 
norms or with that of other children who are motivated only with the standard 
verbal encouragement or praise. Any deviation from standard motivating condi- 
tions for a particular test should be noted and taken into account in interpreting 
performance. 

Although rapport can be more fully established in individual testing, steps can 
also be taken in group testing to motivate test takers and relieve their anxiety. 
Specific techniques for establishing rapport vary with the nature of the test and 
with the age and other characteristics of the persons tested. In testing preschool 
children, special factors to be considered include shyness with strangers, dis- 
tractibility, and negativism. A friendly, cheerful, and relaxed manner on the part 
of the examiner helps to reassure the child. The shy, timid child needs more pre- 
liminary time to become familiar with the surroundings. For this reason, it is bet- 
ter for the examiner not to be too demonstrative at the outset, but rather to wait 
until the child is ready to make the first contact. Test periods should be brief, and 
the tasks should be varied and intrinsically interesting to the child. The testing 
should be presented to the child as a game and his curiosity aroused before each 
new task is introduced. A certain flexibility of procedure is necessary at this age 
level because of possible refusals, loss of interest, and other manifestations of 
negativism. 

Children in the first two or three grades of elementary school present many of 
the same testing problems as the preschool child. The game approach is still the 
most effective way of arousing interest in the test. The older schoolchild can usu- 
ally be motivated through an appeal to the competitive spirit and the desire to do 
well on. tests. When testing children from educationally disadvantaged back- 
grounds or from different cultures, however, the examiner cannot assume that 
they will be motivated to excel on academic tasks to the same extent as children 
in the standardization sample. This problem and others pertaining to the testing 
of persons with dissimilar experiential backgrounds are considered further in 
chapters 9, 12, and 18. 

Special motivational problems may be encountered in testing emotionally dis- 
turbed persons, prisoners, or juvenile delinquents. Especially when examined in 
an institutional setting, such persons are likely to manifest a number of unfavor- 
able attitudes, such as suspicion, insecurity, fear, or cynical indifference. Special 
conditions in their past experiences are also likely to influence their test perfor- 
mance adversely. As a result of early failures and frustrations in school, for exam- 
ple, they may have developed feelings of hostility and inferiority toward academic 
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tasks, which the tests resemble. The experienced examiner makes special efforts 
to establish rapport under these conditions. In any event, he or she must be sensi- 
tive to these special difficulties and take them into account in interpreting and 
explaining test performance. 

In testing any school-age child or adult, one should bear in mind that every 
test presents an implied threat to the individual's prestige. Some reassurance 
should therefore be given at the outset. It is helpful to explain, for example, that 
no one is expected to finish or to get all the items correct. The test taker might 
otherwise experience a mounting sense of failure as he or she advances to the 
more difficult items or is unable to finish any subtest within the time allowed. 

It is also desirable to eliminate the element of surprise from the test situation as 
far as possible, because the unexpected and unknown are likely to produce anxi- 
ety. Many group tests include a preliminary explanatory statement that is read to 
the group by the examiner. An even better procedure is to provide each test taker 
in advance with materials that explain the purpose and nature of the tests, offer 
general suggestions on how to take tests, and contain a few sample items. Such 
explanatory booklets are regularly available to participants in many large-scale 
testing programs, such as those conducted by the College Board. 

Adult testing presents some additional problems. Unlike the schoolchild, the 
adult is not so likely to work hard at a task merely because it is assigned. It there- 
fore becomes more important to “sell” the purpose of the tests to the adult, al- 
though high school and college students also respond to such an appeal. 
Cooperation of test takers can usually be secured by convincing them that it is in 
their own interests to obtain a valid score—that is, a score correctly indicating 
what they can do rather than overestimating or underestimating their abilities. 
Most persons will understand that an incorrect decision, which might result from 
invalid test scores, would mean subsequent failure, loss of time, and frustration for 
them. This approach can serve not only to motivate test takers to try their best on 
ability tests but also to reduce faking and encourage frank reporting on personal- 
ity inventories, because respondents realize that they themselves would otherwise 
be the losers. It is certainly not in the best interests of individuals to be admitted 
to a course of study for which they lack the prerequisite skills and knowledge or 
assigned to a job they cannot perform or would find uncongenial. 


EXAMINER AND SITUATIONAL VARIABLES 


Comprehensive surveys of the effects of examiner and situational variables on test 
scores have been published periodically (Lutey & Copeland, 1982; Masling, 
1960; S. B. Sarason, 1954; Sattler, 1970, 1988; Sattler & Theye, 1967). Although 
some effects have been demonstrated with objective group tests, most of the data 
have been obtained with either projective techniques or individual intelligence 
tests. These extraneous factors are more likely to operate with unstructured and 
ambiguous stimuli, as well as with difficult and novel tasks, than with clearly 
defined and well-learned functions. In general, children are more susceptible to 
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examiner and situational influences than are adults; in the examination of pre- 
school children, the role of the examiner is especially crucial. Emotionally dis- 
turbed and insecure persons of any age are also more likely to be affected by such 
conditions than are well-adjusted persons. 

Performance on individually administered intelligence tests and on projective 
techniques has been investigated in relation to many examiner variables, includ- 
ing age, sex, ethnicity, professional or socioeconomic status, training and experi- 
ence, personality characteristics, and appearance. Although several significant 
relationships have been found, the results are often inconclusive or misleading 
because the experimental design failed to control or isolate the influence of differ- 
ent examiner or examinee characteristics. Hence the effect of two or more vari- 
ables may have been confounded. 

That test results may be influenced by the examiner's behavior immediately 
preceding and during test administration has been more clearly demonstrated. For 
example, controlled investigations have yielded significant differences in intelli- 
gence test performance as a result of a “warm” versus a “cold” interpersonal rela- 
tion between examiner and examinees, or a rigid and aloof versus a natural 
manner on the part of the examiner (Exner, 1966; Masling, 1959). Moreover, 
there may be significant interactions between examiner and examinee character- 
istics, in the sense that the same examiner characteristics or testing manner may 
have a different effect on different examinees as a function of the examinee's own 
personality characteristics. Similar interactions may occur with task variables, 
such as the nature of the test, the purpose of the testing, and the instructions 
given to test takers. Dyer (1973) added even more variables to this list, calling at- 
tention to the possible influence of the test givers’ and the test takers’ diverse per- 
ceptions of the functions and goals of testing. 

Still another way in which an examiner may inadvertently affect the test 
taker's responses is through the examiner’s own expectations. This is simply a spe- 
cial instance of the self-fulfilling prophecy (Harris & Rosenthal, 1985; R. Rosen- 
thal, 1966; R. Rosenthal & Rosnow, 1969). An experiment conducted with the 
Rorschach will illustrate this effect (Masling, 1965). The examiners were 14 grad- 
uate student volunteers, 7 of whom were told, among other things, that experi- 
enced examiners elicit more human than animal tesponses, while the other 7 
were told that experienced examiners elicit more animal than human responses. 
Under these conditions, the two groups of examiners obtained significantly differ- 
ent ratios of animal to human responses from their examinees. These differences 
occurred despite the fact that neither examiners nor test takers reported aware- 
ness of any influence attempt. Moreover, tape recordings of all testing sessions re- 
vealed no evidence of verbal influence on the part of any examiner. The 
examiners’ expectations apparently operated through subtle postural and facial 
cues to which the test takers responded. 

Apart from the examiner, other aspects of the testing situation may signifi- 
cantly affect test performance. Military recruits, for example, have often been ex- 
amined shortly after induction, during a period of intense readjustment to an 
unfamiliar and stressful situation. In one investigation designed to test the effect 
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of acclimatization to such a situation-on test performance, 2,724 recruits were 
given the Navy Classification Battery during their ninth day at the Naval Train- 
ing Center (L. V. Gordon & Alf, 1960). When their scores were compared with 
those obtained by 2,180 recruits tested at the conventional time, during their 
third day, the 9-day group scored significantly higher on all subtests of the battery. 

The test takers' activities immediately preceding the test may also affect their 
performance, especially when such activities produce emotional disturbance, fa- 
tigue, or other handicapping conditions. In an investigation with third- and 
fourth-grade schoolchildren, there was some evidence to suggest that IQ on the 
Draw-a-Man Test was influenced by the children's preceding classroom activity 
(McCarthy, 1944). On one occasion, the students had been engaged in writing a 
composition on “The Best Thing That Ever Happened to Me”; on the second oc- 
casion, they had again been writing, but this time on "The Worst Thing That 
Ever Happened to Me." The IQs on the second test, following what may have 
been an emotionally depressing experience, averaged 4 or 5 points lower than on 
the first test. These findings were corroborated in a later investigation specifically 
designed to determine the effect of immediately preceding experience on the 
Draw-a-Man Test (Reichenberg-Hackett, 1953). In this study, children who had 
had a gratifying experience involving the successful solution of an interesting puz- 
zle, followed by a reward of toys and candy, showed more improvement in their 
test scores than those who had undergone neutral or less gratifying experiences. 
Similar results were obtained by W. E. Davis (1969a, 1969b) with college stu- 
dents. Performance on an arithmetic reasoning test was significantly poorer when 
preceded by a failure experience on a verbal comprehension test than it was in a 
control group given no preceding test and in one that had taken a standard verbal 
comprehension test under ordinary conditions. 

Several studies have been concerned with the effects of feedback regarding test 
scores on the individual's subsequent test’ performance: In a particularly well- 
designed investigation with seventh-grade students, Bridgeman (1974) found 
that “success” feedback was followed by significantly higher performance on a 
similar test than was “failure” feedback in students who had actually performed 
equally well to begin with. This type of motivational feedback may operate 
largely through the goals the participants set for themselves in subsequent perfor- 
mance and may thus represent another example of the self-fulfilling prophecy. 
Such general motivational feedback, however, should not be confused with cor- 
tective feedback, whereby the individual is informed about the specific items he 
or she missed and given remedial instruction; under these conditions, feedback is 
much more likely to improve the performance of initially low-scoring persons. 

The examples cited in this section illustrate the wide diversity of test-related 
variables that may affect test scores. In the majority of well-administered testing 
programs, the influence of these variables is negligible for practical purposes. Nev- 
ertheless, the skilled examiner is constantly on guard to detect the possible opera- 
tion of such variables and to minimize their influence. When circumstances do 
not permit the control of these conditions, the conclusions drawn from test per- 
formance should be qualified. 


34 FUNCTIONS AND ORIGINS OF PSYCHOLOGICAL TESTING 


A VIEW FROM THE TEST-TAKER'S PERSPECTIVE 


Test Anxiety. Among the earliest studies of the reactions of examinees to the 
testing situation are those dealing with test anxiety. Undoubtedly, early interest 
in this type of response was stimulated by its ready visibility and its clearly 
apparent detrimental effects on performance. In test administration, many of the 
practices designed to enhance rapport serve also to reduce test anxiety. 
Procedures tending to dispel surprise and strangeness from the testing situation 
and to reassure and encourage the test taker should certainly help to lower 
anxiety. The examiner’s own manner and a well-organized, smoothly running 
testing operation will contribute toward the same goal. 

Individual differences in test anxiety have been studied with both schoolchil- 
dren and college students (Gaudry & Spielberger, 1974; Hagtvet & Johnsen, 
1992; I. G, Sarason, 1980; Spielberger, 1972). Much of this research was initiated 
by S. B. Sarason and his associates at Yale (Sarason, Davidson, Lighthall, Waite, 
& Ruebush, 1960). The first step was to construct a questionnaire to assess the in- 
dividual’s test-taking attitudes. The children’s form, for example, contained items 
such as the following: 


Do you worry a lot before taking a test? 

When the teacher says she is going to find out how much you have learned, 
does your heart begin to beat faster? 

While you are taking a test, do you usually think you are not doing well? 


Of primary interest is the finding that both school achievement and intelligence 
test scores yielded significant negative correlations with test anxiety. Similar cor- 
relations have been found among college students (I. G. Sarason, 1961). Longitu- 
dinal studies likewise revealed an inverse relation between changes in anxiety 
level and changes in intelligence or achievement test performance (K. T. Hill & 
S. B. Sarason, 1966; S. B. Sarason, K. T. Hill, & Zimbardo, 1964). 

Such findings, of course, do not indicate the direction of causal relationships. It 
is possible that students develop test anxiety because they perform poorly on tests 
and have thus experienced failure and frustration in previous test situations. In 
support of this interpretation is the finding that within subgroups of high scorers 
on intelligence tests, the negative correlation between anxiety level and test per- 
formance disappeared (Denny, 1966; Feldhusen & Klausmeier, 1962). On the 
other hand, there is evidence suggesting that at least some of the relationship re- 
sults from the deleterious effects of anxiety on test performance. In one study 
(Waite, Sarason; Lighthall, & Davidson, 1958), high-anxious and low-anxious 
children equated in intelligence test scores were given repeated trials in a learn- 
ing task. Although initially equal in the learning test, the low-anxious group im- 
proved significantly more than the high-anxious. 

Several investigators have compared test performance under conditions de- 
signed to evoke "anxious" and "relaxed" states. Mandler and Sarason (1952), for 
example, found that ego-involving instructions, such as telling test takers that 
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everyone is expected to finish in the time allotted, had a beneficial effect 
on the performance of low-anxious persons, but a deleterious effect on that of 
high-anxious persons. Other studies have likewise found an interaction between 
testing conditions and such individual characteristics as anxiety level and 
achievement motivation (Lawrence, 1962; Paul & Eriksen; 1964). It appears 
likely that the relation between anxiety and test performance is nonlinear, a 
slight amount of anxiety being beneficial while a large amount is detrimental. In- 
dividuals who are customarily low-anxious benefit from test conditions that 
arouse some anxiety, while those who are customarily high-anxious perform bet- 
ter under more relaxed conditions. 

It is undoubtedly true that a chronically high anxiety level exerts a detrimen- 
tal effect on school learning and intellectual development. Anxiety interferes 
with both the acquisition and the retrieval of information (Hagtvet & Johnsen, 
1992). Such an effect, however, should be distinguished from the test-limited ef- 
fects with which this discussion is concerned. To what extent does test anxiety 
make the individual’s test performance unrepresentative of his or her customary 
performance in nontest situations. Because of the competitive pressure experi- 
enced by college-bound high school seniors, it has been argued that performance 
on college admission tests may be unduly affected by test anxiety. In a thorough 
and well-controlled investigation of this question, French (1962) compared the 
performance of high school students on a test given as part of the regular admin- 
istration of the Scholastic Aptitude Test with performance on a parallel form of 
the test administered at a different time under "relaxed" conditions. The instruc- 
tions on the latter occasion specified that the test was given for research purposes 
only and scores would not be sent to any college. The results showed that perfor- 
mance was no poorer during the standard administration than during the relaxed 
administration. Moreover, the concurrent validity of the test scores against high 
school course grades did not differ significantly under the two conditions. Sev- 
eral recent investigations have also called into question the common stereotype 
of the test-anxious student who knows the subject matter but "freezes up" when 
taking a test (see Culler & Holahan, 1980). This research found that students 
who score high on a test anxiety scale obtain lower grade-point averages and 
tend to have poorer study habits than do those who score low in test anxiety. 

Research on the nature, measurement, and treatment of test anxiety has been 
continuing at an ever-increasing pace (I. G. Sarason, 1980; Spielberger, Anton, 
& Bedell, 1976; Spielberger, Gonzalez, & Fletcher, 1979; Spielberger, Gonzalez, 
Taylor, Algaze, & Anton, 1978; G. S. Tryon, 1980). With regard to the nature of 
test anxiety, two important components have been identified, namely, emotional- 
ity and worry. The emotionality component comprises feelings and physiological 
reaction, such as tension and increasing heartbeat. The worry or cognitive com- 
ponent includes negative self-oriented thoughts, such as expectation of doing 
poorly and concern about the consequences of failure. These thoughts draw at- 
tention away from the task-oriented behavior required by the test and thereby 
disrupt performance. Both components are measured by several test anxiety in- 
ventories. Although widely used in research, such inventories have until recently 
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been available only through reports in the research literature. The Test Anxiety 
Inventory developed by Spielberger and his co-workers is an example of a pub- 
lished test; it is described in chapter 13 and listed in Appendix.A. 

Considerable effort has been devoted to the development and evaluation of 
methods for treating test anxiety. These include several behavior therapy proce- 
dures (chap. 17) for reducing the emotional component of rest anxiety. The re- 
sults have generally been positive, but it is difficult to attribute the improvement 
to any particular technique because of methodological flaws in the evaluation 
studies (G. S. Tryon, 1980). In fact, the emotionality component of test anxiety 
tends to decrease from test to retest even in control groups with no thetapeutic 
intervention, as well as in special control groups given a credible pseudotherapy. 
Moreover, reduction in the emotionally component has little or no effect on per- 
formance level. 

Performance in both tests and course work is more likely to improve when 
treatment is directed to the self-oriented cognitive reactions. Available research 
thus far suggests that the best results are obtained from combined treatment pro- 
grams, which include the elimination of emotionality and worry, as well as the 
improvement of study skills. Test anxiety is a complex phenomenon with multi- 
ple causes, and the relative contribution of different causes varies with the indi- 
vidual. To be effective, treatment programs should be adapted to individual needs. 
It must also be recognized that test anxiety is only one manifestation of a more 
general set of conditions that reduce the individual's effectiveness as a learner. 


Comprehensive Investigation of ‘Test-Taker Views. Although test anxi- 
ety is a conspicuous and important aspect of test-taker behavior, there is much 
more that can profitably be explored. A 1993 book edited by Baruch Nevo and 
R. S. Jager provides a broad-based effort to gather available information on 
examinees’ reactions to testing in educational, industrial, clinical, and counseling 
settings. The fifteen chapters by established researchers on various aspects and 
applications of testing report available multinational publications on each topic, 
as well as the findings of the authors’ own research. The result is a serious, data- 
based attempt to answer questions heretofore treated primarily in journalistic, 
political, or legalistic sources. The book thus serves as a corrective for possibly 
biased and conflicting opinions about testing that abound today. For example, the 
first chapter draws upon ten well-conducted opinion surveys of attitudes toward 
testing in samples of a wide range of populations. The results reveal discrepancies 
between the publics views and some of the claims of widely heard bur 
unrepresentative spokespersons on testing issues. 

The individual chapters span a diversity of topics. Some describe the develop- 
ment and use of feedback questionnaires and group interviews for assessing atti- 
tudes toward a given test and perceptions of what the test measures, as expressed 
by different groups of test takers. One chapter compared students’ opinions of 
essay and multiple-choice classroom tests, with results that strongly favored the 
multiple-choice format. Some authors deal with the reactions of job applicants to 
fairness and job-relatedness of tests. Several chapters suggest ways of improving 
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test administration and the testing environment, as a result of their findings. On 
the whole, the various chapters open up a promising area to explore in seeking so- 
lutions to some of the current social and practical problems in testing. They also 
provide ways of enhancing mutual understanding between test users and test takers. 


EFFECTS OF TRAINING ON TEST PERFORMANCE 


In evaluating the effects of training or practice on test scores, a fundamental ques- 
tion is whether the improvement is limited to the specific items included in the 
test or whether it extends to the broader behavior domain that-the test is de- 
signed to assess (Anastasi, 1981a, 1981b). The answer to this question represents 
the difference between coaching and education. Obviously, any educational ex- 
perience the individual undergoes, either formal or informal, in or out of school, 
should be reflected in her or his performance on tests that sample the relevant as- 
pects of behavior. Such broad influences will in no way invalidate the test, insofar 
as the test score presents an accurate picture of the individual's standing in the 
abilities under consideration. The difference is, of course, one of degree. Influ- 
ences cannot be classified as either narrow or broad but obviously vary widely in 
scope; from those affecting only-a single administration of a single test, through 
those affecting performance op all items of a certain type, to those influencing the 
individual's performance in the large majority of activities. From the standpoint 
of effective testing, however, a workable distinction can be made: Thus, it can be 
stated hat à test score is invalidated only when a particular experience raises the 
score without appreciably affecting the behavior domain that the test is designed 
to measure. 


Coaching. The effects. of coaching on test scores have been widely investi- 
gated. Several early studies were conducted by British psychologists, with special 
reference to the effects of practice and coaching on the tests formerly used in 
assigning 11-year-old children to different types of secondary schools (Yates et al., 
1953-1954). As might be expected, the extent of improvement depends on the 
ability and earlier educational experiences of the test takers, the nature of the 
tests, and the amount and type of coaching provided: Individuals with deficient 
educational backgrounds are more likely to benefit from special coaching than are 
those who have had superior educational opportunities and are already prepared 
to do well on the tests. It is obvious, too, that the closer the resemblance between 
test content and coaching material, the greater will be the improvement in test 
scores: On the other hand} the more closely instruction is restricted to specific 
test content, the less likely is improvement to extend to criterion performance: 
Many coaching studies, moreover, yield ambiguous and uninterpretable results 
because of serious methodological shortcomings (Anastasi, 1981a; Bond, 1989; 
Messick, 19802). Chief among these shortcomings is the failure to employ a 
noncoached control-group ‘that is truly comparable to the coached group. 
Students who enroll in commercial coaching programs, for example, are self- 
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selected and tend to differ from those in the control group in initial ability, 
motivation, and other personal characteristics that affect test performance. In 
experimental designs that employ pretests and posttests, furthermore, it is difficult 
to ensure that test takers are equally motivated to do well on both tests; this is 
especially true if one test is given in a regular administration, the other in a 
special administration for Practice or for research purposes only. 

The College Entrance Examination Board has been concerned about the 
spread of ill-advised commercial coaching courses for college applicants. To clar- 
ify the issue, the College Board has conducted several well-controlled experi- 
ments to determine the effect of such coaching on the Scholastic Aptitude Test 
(SAT) and has surveyed the results of similar studies by other, independent inves- 
tigators (Donlon, 1984; Messick, 1980a, 1981; Messick & Jungeblut, 1981). 
These studies covered a variety of coaching methods and included students in 


It should also be noted that in their test construction procedures, such organi- 
zations as the College Board and the Graduate Record Examination Board inves- 

. tigate the susceptibility of new item types to coaching (Evans & Pike, 1973; 
Powers, 1983; Powers & Swinton, 1984; Swinton & Powers, 1985). Item types on 
which performance can be appreciably raised by short-term drill or instruction of 


obvious example would be the type of problem that requires a simple, insightful 
solution, that, once attained, can be applied directly to solving all similar prob- 


structions (Powers, 1986). 

Coaching, in the narrow, traditional sense, is designed to develop highly re- 
stricted skills that may be of little use in life activities. Similarly, the practice of 
"teaching to the test" tends to Concentrate on the particular sample of skills and 
knowledge covered by the test, rather than on the broader knowledge domain 
that the test tries to assess. The so-called “truth in testing,” or disclosure, laws 
that require the general release of test forms after a single administration also en- 
courage a focus on test-specific skills of limited applicability. Finally, insofar as 
coaching may be available to some test takers and not to others, it tends to intro- 
duce individual differences in narrowly defined test-taking skills, thereby reduc- 
ing the diagnostic value of the test. 


Test Sophistication. The effects of test sophistication, or sheer test-taking 
practice, are also relevant in this connection. In studies with alternate forms of 
the same test, there is a tendency for the second score to be higher. Significant 
mean gains have been reported when alternate forms were administered in 
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immediate succession or after intervals ranging from one-day to three years 
(Donlon, 1984; Droege, 1966; Peel, 1951, 1952). Similar results have been 
obtained with normal and intellectually gifted schoolchildren, high school and 
college students, and employee samples. Data on the distribution of gains to be 
expected on a retest with a parallel form should be provided in test manuals, and 
allowance for such gains should be made when interpreting test scores. 

Nor are score gains limited to alternate forms. The individual who has had 
extensive prior experience in taking standardized tests enjoys a certain advantage 
in test performance over one who is taking her or his first test (Millman, Bishop, 
& Ebel, 1965; Rodger, 1936). Part of this advantage stems from having overcome 
an initial feeling of strangeness, as well as from having developed more self- 
confidence and better test-taking attitudes. Part is the result of a certain amount 
of overlap in the type of content and functions covered by many tests. Specific fa- 
miliarity with common item types and practice in the use of objective answer 
sheets may also improve performance slightly. It'is particularly important to take 
test sophistication into account when comparing the scores obtained by persons 
whose test-taking experience may have varied widely. For computer-administered 
testing, attention should be given to the test taker's familiarity with this form of 
test administration (Hofer & Green, 1985). 

Short orientation and practice sessions can be quite effective in equalizing test 
sophistication (Anastasi, 1981a; Wahlstrom & Boersman, 1968). Such familiar- 
ization training reduces the effects of prior differences in test-taking experience as 
such: Since these individual differences are specific to the test situation, their 
diminution should permit a more valid assessment of the broad behavior domain 
the test is designed to measure. This approach is illustrated by the College Board 
publication entitled Taking the SAT I: Reasoning Test, a booklet distributed to all 
college applicants who register for this test. The booklet offers suggestions for ef- 
fective test-taking behavior, illustrates and explains the different types of items 
included in the test, and reproduces a complete form of the test, which students 
are advised to take under standard timing conditions and to score with the given 
key. A similar booklet, Taking the SAT II: Subject Tests, illustrates and explains 
items from different subject-matter tests: 

The Graduate Record Examinations (GRE) also provide test familiarization 
materials. The Information Bulletin distributed to all applicants includes an expla- 
nation of sample items from the General Test, as well as a complete previously ad- 
ministered test with its scoring key. Additional test forms are published regularly 
for this test in book form (Practicing to Take the GRE General Test). Similar prac- 
tice booklets are available that provide individual GRE tests in several subject- 
matter areas. 

The upsurge of test familiarization materials appearing in the 1980s and 1990s 
is not limited to the printed medium but includes transparencies, slides, films, 
videocassettes, and computer software. Many of these materials have been devel- 
oped and are distributed by Educational Testing Service. A few are designed for 
use with specific tests, as illustrated by a slide show to accompany Taking the SAT 
and others on the interpretation of scores on the SAT and on the College Board 
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achievement tests. A computer program to aid in understanding SAT scores is 
also available. A relatively elaborate test-preparation software package has been 
developed for students planning to take the GRE General Test. By means of an 
interactive computer program, this package provides sample items; a simulated, 
timed. testing situation; explanations for incorrectly answered items; and an 
analysis of the test taker's strengths and weaknesses. 

Other materials (print, films, multimedia packages, computer software) are de- 
signed for more general test-taking orientation, spanning a range from primary- 
school children to adults. An example is On Your Own: Preparing. for. a 
Standardized Test (1987), a video disk for use by high school students individually 
or in groups. A simple, comprehensive guide in book form is How.to Take a Test: 
Doing Your Best (Dobbin, 1984). Test orientation aids have also been prepared by 
a few major commercial test publishers and by government agencies. The latter is 
illustrated by the set of materials for use with the General Aptitude Test Battery 

: (GATB) of the U.S. Employment Service. 


Instruction in: Broad Cognitive Skills. Some researchers have been ex- 
ploring the opposite approach to the improvement of test performance. Their 
goal is the development of widely applicable intellectual skills, work habits, and 
problem-solving strategies. The effects of such interventions should. be mani- 
fested in both test scores and criterion performance, such as college courses. In 
accordance with the distinction introduced at the opening of this section, this 
type of program is designed to provide education rather than coaching. Some of 
these investigators have been working with educable mentally retarded children 
and adolescents (Babad & Budoff, 1974; Belmont & Butterfield, 1977; A. L. 
Brown, 1974; Budoff & Corman, 1974; Campione & Brown, 1979, 1987; 
Feuerstein, 1979, 1980; Feuerstein, Rand, Jensen, Kaniel, & Tzuriel, 1987). 
Others have concentrated on college and professional school students from 
educationally disadvantaged backgrounds (Linden & Whimbey, 1990; Whimbey, 
1975, 1977,.1980)... . 1 

Many: of the training procedures employed in these programs are designed to 
develop effective problem-solving behavior, such as careful analysis of problems or 
questions; consideration of all alternatives, relevant details, and implications in 
arriving at a solution; deliberate rather than impulsive formulation or choice of a 
solution; and the application of high standards in evaluating one's own perfor- 
mance: These are obviously strategies that should improve one’s intellectual func- 
tioning not only on tests but also in academic work and.in many other 
everyday-life activities that depend on school learning. A crucial question, how- 
ever, pertains to the degree of transfer and generalizability of effects beyond the 
types of contents and settings employed in the training. Results thus far reported 
are promising. But the programs are still in an exploratory stage, and more research 
is needed to establish the breadth and durability of the improvement attained. 


Overview. ` We have considered three types of pretest training that are quite 
dissimilar in their objectives. How do these types of training affect the validity of 
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a.test and its practical usefulness as an assessment instrument? The first. was 

coaching, in the sense of intensive, massed drill on items similar to those op the 

test. It was noted that well-constructed tests choose item types so as to minimize 

their susceptibility to such drill; and they also protect the security of specific test’ 
items. Insofar as such coaching might improve test performance, it would do so 

without a corresponding improvement in criterion behavior. Hence, it would 

thereby reduce test validity. The test would become a less effective measure of the 

broad abilities it was designed to assess and.a less accurate means of ascertaining 

whether the individual. has developed the skills and knowledge prerequisite for 

success in the criterion situation. 

Test orientation procedures, on the other hand, are designed to rule out or 
equalize differences in prior test-taking experience. Like the effects of coaching, 
these differences represent conditions that affect test scores as such, without nec- 
essarily being reflected in the broader behavior domain to be assessed. Hence, the 
test orientation procedures should make the test a more valid instrument by re- 
ducing the influence of test-specific factors. 

Finally, training in broadly applicable cognitive skills, if effective, should im- 
prove the trainee’s ability to cope with subsequent intellectual tasks. This im- 
provement will and should be reflected in test performance. Insofar as both test 
scores and criterion performance are improved, such training leaves test validity 
unchanged; but it enhances the individual’s chances of attaining desired goals, 


SOURCES OF INFORMATION ABOUT TESTS 


Psychological testing is in a state of rapid change. There are shifting orientations, 
a constant stream of new tests, revised forms of old tests, and additional data that 
may refine or alter the interpretation of scores on existing tests. The accelerating 
rate of change, together with the vast number of available tests, makes it imprac- 
ticable to survey specific tests in any single text. More intensive coverage of test- 
ing instruments and problems in special areas can be found in books dealing with 
the use of tests in such fields as counseling, clinical practice, personnel selection, 
and education. References to such publications are given in the appropriate chap- 
ters of this book. In order to keep abreast of current developments, however, any- 
one working with tests needs to be familiar with more direct sources of 
information about tests. 

One of the most important sources is the Mental Measurements Yearbook 
(MMY) established by Oscar K.. Buros and edited by him through 1978. Since 
1985, the MMY has been published by the Buros Institute of Mental Measure- 
ments at the University of Nebraska. This series of yearbooks covers neatly all 
commercially available psychological, educational, and vocational tests published 
in English. The coverage is especially complete for paper-and-pencil tests. Each 
yearbook includes tests published during a specified period, thus supplementing 
rather than supplanting the earlier yearbooks. The earliest publications in this se- 
ties were merely bibliographies of tests. Beginning in 1938, the yearbook assumed 
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its current form, which includes critical reviews of the tests by one or more test 
experts, as well as a complete list of published references pertaining to each test. 
Routine information regarding publisher, price, forms, and age of persons for 
whom the test is suitable is also regularly given. The current plan is to publish a 
new MMY every two or three years, with a supplement published between year- 
books. 

The test entries of the MMY, together with the critical reviews, are now avail- 
able electronically through SilverPlatter (see Appendix B). The database begins 
with the entries in the ninth MMY and is updated every six months. Another 
publication of the Buros Institute is Tests in Print, now in its fourth volume 
(TIP-IV, 1994), edited by L. L. Murphy, Conoley, and Impara. This publication 
provides cumulative coverage of all known commercially published tests in Eng- 
lish, together with factual information and reference lists. Each successive edition 
-of Tests in Print can also be used as an index to all the MMYs that preceded it. 

Another major source of information about published tests is the Test Collec- 
tion Bibliographies prepared by Educational Testing Service (ETS). Annotated test 
bibliographies are available in specific content areas. Coverage is comprehensive, 
including all types of tests, as well as tests designed for particular uses and for spe- 
cial populations, such as physically disabled persons. Each test entry provides fac- 
tual information, including author, publication date, publisher, target population, 
purpose of the test, and any subscores or variables measured. Test bibliographies 
for particular areas can be purchased for a nominal fee from Test Collection, ETS 
(address in Appendix B). This is one of several ETS publications providing cur- 
rent information on tests and testing. 

Besides published tests, there is a vast store of unpublished tests that are de- 
scribed or reproduced in books, journals, or unpublished reports. Of interest 
chiefly to research workers, such tests have been surveyed in various compendi- 
ums (e.g., Goldman & Mitchell, 1995). Current information about unpublished 
tests is available through Tests in Microfiche, distributed by Test Collections, ETS. 
A new set of these tests is added each year, and an index of each set can be ob- 
tained. Individual tests or sets may be purchased by qualified users. A clear and 
concise guide for finding information on both published and unpublished tests is 
available from the Science Directorate of the American Psychological Associa- 
tion (Finding Information, 1995). This source is regularly updated; anyone request- 
ing a copy will automatically receive the latest version. 

For the test user, the most direct source of information about particular current 
tests is provided by the catalogs of test publishers and by the manual that accom- 
panies each test. A comprehensive list of test publishers, with addresses, can be 
found in the latest Mental Measurements Yearbook. For ready reference, the names 
and addresses of publishers whose tests are cited in this book are given in Appen- 
dix B. Catalogs of current tests can be obtained from the major publishers on re- 
quest. Manuals and specimen sets of tests can be purchased by qualified users. 

The test manual should provide the essential information required for adminis- 
tering, scoring, and evaluating a particular test. In it should be found full and de- 
tailed instructions, Scoring key, norms, and data on reliability and validity. 
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Moreover, the manual should report the number and nature of persons on whom 
norms, reliability, and validity were established and the methods employed in 
computing indices of reliability and validity. In the event that the necessary infor- 
mation is too lengthy to fit conveniently into the manual, references to a techni- 
cal manual or other printed sources in which such information can be readily 
located should be given. The manual should, in other words, enable test users to 
evaluate the test before choosing it for their specific purposes. It should be added 
that some test manuals still fall short of this goal. But the larger and more profes- 
sionally oriented test publishers are giving increasing attention to the preparation 
of manuals that meet adequate scientific standards. An enlightened public of test 
users provide the firmest assurance that such standards will be maintained and im- 
proved in the future. 

A succinct but comprehensive guide for the evaluation of psychological tests is 
to be found in the Standards for Educational and Psychological Testing, prepared by 
the American Psychological Association (APA), in collaboration with two other 
associations concerned with testing, the American Educational Research Associ- 
ation (AERA) and the National Council on Measurement in Education 
(NCME). First published in 1954, the Standards were revised in 1966, 1974, and 
1985. A further, extensive revision is in progress by the three participating associ- 
ations. 

The need to establish Testing Standards? that concern not only the technical 
quality of the tests but also the effect of testing on the welfare of the individual 
began to surface in the 1980s (see Fig. 1—1, p. 30). That this concern represents a 
continuing trend is indicated by the nature of the most recent revision of the 
Testing Standards. Figure 1—2 on p. 31 contains a proposed list of Standards pre- 
pared by a committee from the three associations in 1996. Clearly, concern with 
adapting the selection of tests—as well as concern with the interpretation and use 
of their scores—to available knowledge about the test-taker's experiential history 
shows a continuing rise. It is noteworthy that a whole section (Part II) of Figure 
1-2 is entitled "Fairness in Testing." Test users are becoming increasingly aware 
that the improper application of tests can harm the individual and decrease the 
effectiveness of her or his contribution to society. Popular criticisms of the misuse 
of tests may well have contributed to this growing awareness by testers, which 
should decrease such misuses. This in turn should increase public recognition of 
the potential benefits of test use. 


?For brevity, we shall henceforth follow the common practice of identifying them as “Testing Standards" through- 
out this book. 
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STANDARDS FOR EDUCATIONAL AND PSYCHOLOGICAL TESTING 


Part |. Technical Standards for Test Construction and Evaluation 
1. Validity 
2. Reliability and errors of measurement 
3. Test development and revision 
4. Scaling, norming, score comparability, and equating 
5. Test publication: technical manuals and users’ guides 
Part Il. Professional Standards for Test Use 
6. General principles of test use 
7. Clinical testing 
8. Educational testing and psychological testing in the schools 
9. Test use in counseling 
10. Employment testing 
11. Professional and occupational licensure and certification 
12. Program evaluation 
Part Ill. Standards for Particular Applications 
13. Testing linguistic minorities 
14. Testing people who have handicapping conditions 
Part IV. Standards for Administrative Procedures 
15. Test administration, scoring, and reporting 


16. Protecting the rights of test takers 


Figure 1-1. Topics covered by the Standards for Educational and 
Psychological Testing (AERA, APA, NCME, 1985). 
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10. 


11. 


12. 


13. 


14. 


15. 


Figure 


PROPOSED LIST OF STANDARDS FOR EDUCATIONAL 
AND PSYCHOLOGICAL TESTING 


Part |. Test Construction, Evaluation, and Documentation 


. Validity 


. Reliability, errors of measurement, and test score information 


function 


. Test development and revision 


Scaling, norming, standards, and score comparability 


. Test administration, scoring and reporting 


. Test documents 


Part Il. Fairness in Testing 


. Fairness and bias 


Protecting the rights of test takers 


. Testing people whose first language is not English 


Testing individuals with disabilities 

Part Ill. Testing Applications 
General Principles of test use 
Psychological testing and assessment 
Educational testing and assessment 
Employment testing, licensure and certification 


Testing in program evaluation and public policy 


1-2. Topics chosen for revised edition of Standards for Educational 


and Psychological Testing (AERA, APA, NCME, 1996). Manuscript in 
preparation. 

(Reproduced by permission of the Joint Committee on the Standards for Educational and Psycholog- 
ical Testing [Dianne Brown, Project Director].) 
aeee 


Historical Antecedents 
of Modern Testing 


brief overview of the historical antecedents and origins of psychological 
testing will provide perspective and should aid in the understanding of 
present-day tests.! The direction in which contemporary psychological 
testing has been progressing can be clarified when considered in the light of 

the precursors of such tests. The special limitations as well as the advantages that 


particular types of tests, such as ability tests (chaps. 8-12) or interest tests (chap. 
14), as well as areas of test use, such as educational, industrial, clinical, and coun- 


The roots of testing are lost in antiquity. There have been repeated accounts of 
the system of civil service examinations prevailing in the Chinese empire for 
some 2,000 years (Bowman, 1989). Among the ancient Greeks, testing was an es- 
tablished adjunct to the educational Process. Tests were used to assess the mastery 
of physical as well as intellectual skills ( Doyle, 1974). From their beginnings in 
the Middle Ages, European universities relied on formal examinations in award- 


eT hes AT, 
1A more detailed account of the early origins of Psychological tests can be found in E. L. Goodenough (1949) 
and J. Peterson (1 


testing, and Anastasi (1965) for historical antecedents of the study of individual differences. An overview of 
ongoing trends in psychological testing can be found in Anastasi ( 1993). 
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ing degrees and honors. To identify the major developments that shaped conten. 
porary testing, however, we need go no farther than the nineteenth century. It is 
to these developments that we now turn. 


EARLY INTEREST IN CLASSIFICATION AND TRAINING 
OF MENTALLY RETARDED PERSONS 


The nineteenth century witnessed a strong awakening of interest in the humane 
treatment of mentally retarded and insane persons. Prior to that time, neglect, 
ridicule, and even torture had been the common lot of these individuals. With 
the growing concern for the proper care of people with mental problems came a 
realization that some uniform criteria for identifying and classifying these cases 
were required. The establishment of many social institutions for the care of men- 
tally retarded persons in both Europe and the United States made the need for 
setting up admission standards and an objective system of classification espe- 
cially urgent. First it was necessary to differentiate between insane and mentally 
retarded individuals. The former manifested emotional disorders that might or 
might not be accompanied by intellectual deterioration from an initially normal 
level; the latter were characterized essentially by intellectual defect that had 
been present from birth or early infancy. What is probably the first explicit state- 
ment of this distinction is to be found in a two-volume work published in 1838 
by the French physician Esquirol (1838), in which more than 100 pages are de- 
voted to what is now termed “mental retardation.” Esquirol also pointed out that 
there are many degrees of mental retardation, varying along a continuum from 
normality to “low-grade idiocy.” In the effort to develop some system for classify- 
ing the different degrees and varieties of retardation, Esquirol tried several proce- 
dutes but concluded that the individual’s use of language provides the most 
dependable criterion of her or his intellectual level. It is noteworthy that current 
ctiteria of mental retardation are also largely linguistic and that present-day in- 
telligence tests are heavily loaded with verbal content. The important part ver: 
bal ability plays in our concept of intelligence will be repeatedly demonstrated in 
subsequent chapters. 

Of special significance are the contributions of another French physician, 
Seguin, who pioneered in the training of mentally retarded persons. Having re- 
jected the prevalent notion of the incurability of mental retardation, Seguin 
(1866/1907) experimented for many years with what he termed the physiological 
method of training; and in 1837 he established the first school devoted to the ed- 
ucation of mentally retarded children. In 1848 he emigrated to America, where 
his ideas gained wide recognition. Many of the sense-training and muscle-training 
techniques subsequently employed in institutions for mentally retarded individu- 
als were originated by Seguin. By these methods, severely retarded children are 
given intensive exercise in sensory discrimination and in the development of 
motor control. Some of the procedures developed by Seguin for this purpose were 
eventually incorporated into performance or nonverbal tests of intelligence. An 
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later. 


THE FIRST EXPERIMENTAL PSYCHOLOGISTS 


The early experimental psychologists of the nineteenth century were not, in gen- 
eral, concerned with the measurement of individual differences. The principal 


cepted as a necessary evil that limited the applicability of the generalizations. 
Thus, the fact that one individual reacted differently from another when ob- 
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CONTRIBUTIONS OF FRANCIS GALTON 


It was the English biologist Francis Galton who was primarily responsible for 
launching the testing movement. A unifying factor in Galton's numerous and 
varied research activities was his interest in human heredity. In the course of his 
investigations on heredity, Galton realized the need for measuring the character- 
istics of related and unrelated persons. Only in this way could he discover, for ex- 
ample, the exact degree of resemblance between parents and offspring, brothers 
and sisters, cousins, or twins. With this end in view, Galton was instrumental in 
inducing a number of educational institutions to keep systematic anthropometric 
records on their students. He also set up an anthropometric laboratory at the In- 
ternational Exposition of 1884 where, by paying threepence, visitors could be 
measured in certain physical traits and could take tests of keenness of vision and 
hearing, muscular strength, reaction time, and other simple sensorimotor func- 
tions. When the exposition closed, the laboratory was transferred to South Ken- 
sington Museum, London, where it operated for six years. By such methods, the 
first large, systematic body of data on individual differences in simple psychologi- 
cal processes was gradually accumulated. 

Galton himself devised most of the simple tests administered at his anthropo- 
metric laboratory, many of which are still familiar either in their original or in 
modified forms. Examples include the Galton bar for visual discrimination of 
length, the Galton whistle for determining the highest audible pitch, and gradu- 
ated series of weights for measuring kinesthetic discrimination. It was Galton’s 
belief that tests of sensory discrimination could serve as a means of gauging a per- 
son’s intellect. In this respect, he was partly influenced by the theories of Locke. 
Thus, Galton wrote: “The only information that reaches us concerning outward 
events appears to pass through the avenue of our senses; and the more perceptive 
the senses are of difference, the larger is the field upon which our judgment and 
intelligence can act” (Galton, 1883, p. 27). Galton had also noted that persons 
with extreme mental retardation tend to be defective in the ability to discrimi- 
nate among heat, cold, and pain—an observation that further strengthened his 
conviction that sensory discriminative capacity “would on the whole be highest 
among the intellectually ablest” (Galton, 1883, p. 29). 

Galton also pioneered in the application of rating-scale and questionnaire 
methods, as well as in the use of the free association technique subsequently em- 
ployed for a wide variety of purposes. A further contribution of Galton is to be 
found in his development of statistical methods for the analysis of data on individ- 
ual differences, Galton selected and adapted a number of techniques previously de- 
tived by mathematicians. These techniques he put in such form as to permit their 
use by the mathematically untrained investigator who might wish to treat test re- 
sults quantitatively. He thereby extended enormously the application of statistical 
procedures to the analysis of test data. This phase of Galton’s work has been carried 
forward by many of his students, the most eminent of whom was Karl Pearson.” 


?For a fascinating account of the early development of major statistical concepts and the persons responsible for 
them, see Cowles (1989). 
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CATTELL AND THE EARLY “MENTAL TESTS” 


An especially prominent position in the development of psychological testing is 
occupied by the American psychologist James McKeen Cattell. The newly estab- 
lished science of experimental psychology and the still newer testing movement 
merged in Cattell’s work. For his doctorate at Leipzig, he completed a dissertation 
on reaction time, under Wundt’s direction. While lecturing at Cambridge in 
1888, Cattell’s interest in the measurement of individual differences was rein- 
forced by contact with Galton. On his return to America, Cattell was active both 
in the establishment of laboratories for experimental psychology and in the 
spread of the testing movement. 

In an article written by Cattell in 1890, the term “mental test” was used for the 
first time in the psychological literature. This article described a series of tests that 
were being administered annually to college students in the effort to determine 
their intellectual level. The tests, which had to be administered individually, in- 
cluded measures of muscular strength, speed of movement, sensitivity to pain, 
keenness of vision and of hearing, weight discrimination, reaction time, memory, 
and the like. In his choice of tests, Cattell shared Galton’s view that a measure of 
intellectual functions could be obtained through tests of sensory discrimination 
and reaction time. Cattell’s preference for such tests was also bolstered by the fact 
that simple functions could be measured with precision and accuracy, whereas the 
development of objective measures for the more complex functions seemed at 
that time a well-nigh hopeless task. 

Cattell's tests were typical of those to be found in a number of test series devel- 
oped during the last decade of the nineteenth century. Such test series were ad- 
ministered to schoolchildren, college students, and miscellaneous adults. At the 
Columbian Exposition held in Chicago in 1893, Jastrow set up an exhibit at which 
visitors were invited to take tests of sensory, motor, and simple perceptual processes 
and to compare their skill with the norms (J. Peterson, 1926; Philippe, 1894). 
A few attempts to evaluate such early tests yielded very discouraging results. 
The individual’s performance showed little correspondence from one test to 
another (Sharp, 1898-1899; Wissler, 1901), and it exhibited little or no relation 
to independent estimates of intellectual level based on teachers’ ratings 
(T. L. Bolton, 1891-1892; J. A. Gilbert, 1894) or academic grades (Wissler, 1901 ). 

A number of test series assembled by European psychologists of the period 
tended to cover somewhat more complex functions. Kraepelin (1895), who was 
interested primarily in the clinical examination of psychiatric patients, prepared a 
long series of tests to measure what he regarded as basic factors in the characteri- 
zation of an individual. The tests, employing chiefly simple arithmetic operations, 
were designed to measure practice effects, memory, and susceptibility to fatigue 
and to distraction. Another German psychologist, Ebbinghaus (1897), adminis- 
tered tests of arithmetic computation, memory span, and sentence completion to 
schoolchildren. The most complex of the three tests, sentence completion, was 
the only one that showed a clear correspondence with the children’s scholastic 
achievement. 
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In an article published in France in 1895, Binet and Henri criticized most of ` 
the available test series as being too largely sensory and as concentrating unduly 
on simple, specialized abilities. They argued further that, in the measurement of 
the more complex functions, great precision is not necessary, since individual dif- 
ferences are larger in these functions. An extensive and varied list of tests was 
proposed, covering such functions as memory, imagination, attention, compre- 
hension, suggestibility, aesthetic appreciation, and many others. In these tests, we 
can recognize the trends that were eventually to lead to the development of the 
famous Binet intelligence scales. 


BINET AND THE RISE OF INTELLIGENCE TESTS 


Binet and his co-workers devoted many years to active and ingenious research on 
ways of measuring intelligence. Many approaches were tried, including even the 
measurement of cranial, facial, and hand form and the analysis of handwriting, 
The results, however, led to a growing conviction that the direct, even though 
crude, measurement of complex intellectual functions offered the greatest 
promise: Then a specific situation arose that brought Binet’s efforts to immediate 
practical fruition. In 1904, the Minister of Public Instruction appointed Binet to 
the previously cited commission to study procedures for the education of retarded 
children, It was in connection with the objectives of this commission that Binet, 
in collaboration with Simon, prepared the first Binet-Simon Scale (Binet & 
Simon, 1905). 

This scale, known as the 1905 scale, consisted of 30 problems or tests arranged 
in ascending order of difficulty. The difficulty level was determined empirically by 
administering the tests to 50 normal children aged 3 to 11 years, and to some 
mentally retarded children and adults. The tests were designed to cover a wide va- 
riety of functions, with special emphasis on judgment, comprehension, and rea- 
soning, which Binet regarded as essential components of intelligence. Although 
sensory and perceptual tests were included, a much greater proportion of verbal 
content was found in this scale than in most test series of the time. The 1905 
scale was presented as a preliminary and tentative instrument, and no precise ob- 
jective method for arriving at a total score was formulated. 

In the second, or 1908, scale, the number of tests was increased, some unsatis- 
factory tests from the earlier scale were eliminated, and all tests were grouped 
into age levels on the basis of the performance of about 300 normal children be- 
tween the ages of 3 and 13 years. Thus, in the 3-year level were placed all tests 
passed by 80 to 90% of normal 3-year-olds; in the 4-year level, all tests similarly 
passed by normal 4-year-olds; and so on to age 13. The child’s score on the en- 
tire test could then be expressed as a mental level corresponding to the age of nor- 
mal children whose performance he or she equaled. In the various translations 
and adaptations of the Binet scales, the term “mental age” was commonly substi- 
tuted for “mental level.” Since mental age is such a simple concept to grasp, the 
introduction of this term undoubtedly did much to popularize intelligence 
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testing)? Binet himself, however, avoided the term "mental age” because of its 
unverified developmental implications and preferred the more neutral term 
*mental level" (T. H. Wolf, 1973). 

A third revision of the Binet-Simon Scale appeared in 1911, the year of Binet's 
untimely death. In this scale, no fundamental changes were introduced. Minor re- 
visions and relocations of specific tests were instituted. More tests were added at 
several year levels, and the scale was extended to the adult level. 

Even prior to the 1908 revision, the Binet-Simon tests attracted wide attention 
among psychologists throughout the world. Translations and adaptations ap- 
peared in many countries, including several in the United States. The first was by 
H. H. Goddard, then research psychologist at the Vineland Training School (for 
mentally retarded children). The Goddard revision was influential in the accep- 
tance of intelligence testing by the medical profession (Zenderland, 1987). It ar- 
rived at a propitious moment to meet the urgent need for a standardized measure 
to diagnose and classify mentally retarded persons. As a testing instrument, how- 
ever, it was soon outdistanced by the more extensive and psychometrically refined 
Stanford-Binet, developed by L. M. Terman and his associates at Stanford Uni- 
versity (Terman, 1916). It was in this test that the intelligence quotient (IQ), 
or ratio between mental age and chronological age, was first used. The later revi- 
sions of this test have been widely employed and will be more fully considered 
in chapter 8. Of special interest, too, is the first Kuhlmann-Binet revision, which 
extended the scale downward to the age of 3 months (Kuhlmann, 1912). This 
scale represents one of the earliest efforts to develop preschool and infant tests of 
intelligence. 


GROUP TESTING 


The Binet tests, as well as all their revisions, are individual scales in the sense that 
they can be administered to only one person at a time. Many of the tests in these 
scales require oral responses from the examinee or necessitate the manipulation of 
materials. Some call for individual timing of responses. For these and other rea- 
sons, such tests are not adapted to group administration. Another characteristic 
of the Binet type of test is that it requires a highly trained examiner. Such tests are 
essentially clinical instruments, suited to the intensive study of individual cases. 
Group testing like the first Binet scale, was developed to meet a pressing practi- 
cal need. When the United States entered World War I in 1917, a committee was 
appointed by the American Psychological Association to consider ways in which 


?E L. Goodenough (1949, pp. 50-51) noted that in 1887, 21 years before the appearance of the 1908 Binet- 
Simon Scale, S. E. Chaille published in the New Orleans Medical and Surgical Journal a series of tests for infants 
arranged according to the age at which the tests are commonly passed. Partly because of the limited circulation 
of the journal and partly, perhaps, because the scientific community was not ready for it, the significance of this 
age-scale concept passed unnoticed at the time. Binet's own scale was influenced by the work of some of his con- 
temporaries, notably Blin and Damaye, who prepared a set of oral questions from which they derived a single 
global score for each child (T. H. Wolf, 1973). 
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psychology might assist in the conduct of the war. This committee, under the di- 
rection of Robert M. Yerkes, recognized the need for the rapid classification of the 
million and a half recruits with respect to general intellectual level. Such infor- 
mation was relevant to many administrative decisions, including rejection or dis- 
charge from military service, assignment to different types of service, or admission 
to officer-training camps. It was in this setting that the first group intelligence test 
was developed. In this task, the army psychologists drew on all available rest ma- 
terials, and especially on an unpublished group intelligence test prepared by 
Arthur S. Otis, which he turned over to the army. A major contribution of Otis's 
test, which he designed while a student in one of Terman's graduate courses, was 
the introduction of multiple-choice and other “objective” item types. 

The tests finally developed by the army psychologists came to be known as the 
Army Alpha and the Army Beta. The former was designed for general routine 
testing; the latter was a nonlanguage scale employed with illiterates and with 
foreign-born recruits who were unable to take a test in English. Both tests were 
suitable for administration to large groups. 

Shortly after the end of World War I, the army tests were released for civilian 
use. Not only did the Army Alpha and Army Beta themselves pass through many 
revisions, but they also served as models for most group intelligence tests. The 
testing movement underwent a tremendous spurt of growth. Soon group intelli- 
gence tests were being devised for all ages and types of persons, from preschool 
children to graduate students. Large-scale testing programs, previously impossible, 
were now being launched with zestful optimism. Because group tests were de- 
signed as mass testing instruments, they not only permitted the simultaneous ex- 
amination of large groups but they also used simplified instructions and 
administration procedures, thereby requiring a minimum of training on the part 
of the examiner. Schoolteachers began to give intelligence tests to their classes. 
College students were routinely examined prior to admission. Extensive studies of 
special adult groups, such as prisoners, were undertaken. And soon the general 
public became IQ-conscious. 

The application of such group intelligence tests far outran their technical im- 
provement. That the tests were still technically crude was often forgotten in the 
rush of gathering scores and drawing practical conclusions from the results. When 
the tests failed to meet unwarranted expectations, skepticism and hostility toward 
all testing often resulted. Thus, the testing boom of the 1920s, based on the indis- 
criminate use of tests, may have done as much to retard as to advance the progress 
of psychological testing. 


APTITUDE TESTING 


Although intelligence tests were originally designed to sample a wide variety of 
functions in order to estimate the individual’s general intellectual level, it soon 
became apparent that such tests were quite limited in their coverage. Not all im- 
portant functions were represented. In fact, most intelligence tests were primarily 
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measures of verbal ability and, to a lesser extent, of the ability to handle numeri- 
cal and other abstract and symbolic relations. Gradually, psychologists came to 
recognize that the term "intelligence test” was a misnomer, since only certain as- 
pects of intelligence were measured by such tests. 

To be sure, the tests covered abilities that are of prime importance in the cul- 
ture for which they were designed. But it was realized that more precise designa- 
tions, in terms of the type of information these tests were able to yield, would be 
preferable. For example, a number of tests that would probably have been called 
intelligence tests during the 1920s later came to be known as scholastic aptitude 
tests. This shift in terminology was made in recognition of the fact that many so- 
called intelligence tests measure that combination of abilities demanded and fos- 
tered by academic work. 

Even prior to World War I, psychologists had begun to recognize the need for 
tests of special aptitudes to supplement the global intelligence tests: These special 
aptitude tests were developed particularly for use in vocational counseling and in 
the selection and classification of industrial and. military personnel. Among the 
most widely used are tests of mechanical, clerical, musical, and artistic aptitudes. 

The critical evaluation of intelligence tests that followed their widespread 
and indiscriminate use during the 1920s also revealed another noteworthy fact: 
An individual’s performance on different parts of such a test often showed 
marked variation. This was especially apparent on group tests, in which the 
items are commonly segregated into subtests of relatively homogeneous content. 
For example, a person might score relatively high on a verbal subtest and low on 
a numerical subtest, or vice versa. To some extent, such internal variability is 
also discernible on a test like the Stanford-Binet, in which, for example, all 
items involving words might prove difficult for a particular individual, whereas 
items employing pictures or geometric diagrams may place her or him at an 
advantage. 

Test users, and especially clinicians, frequently utilized such intercomparisons 
in order to obtain more insight into the individual’s psychological makeup. Thus, 
not only the IQ or other global score but also performance on item clusters or sub- 
tests would be examined in the evaluation of the individual case. Such a practice 
is not to be generally recommended, however, because intelligence tests were not 
designed for the purpose of differential aptitude analysis. Often the subtests being 
compared contain too few items to yield a stable or reliable estimate of a specific 
ability. As a result, the obtained difference between subtest scores might be re- 
versed if the individual were retested on a different day or with another form of 
the same test. If such intra-individual comparisons are to be made, tests are 
needed that are specially designed to reveal differences in performance in various 
functions. 

While the practical application of tests demonstrated the need for multiple ap- 
titude tests, a parallel development in the study of trait organization was gradually 
providing the means for constructing such tests. Statistical studies on the nature 
of intelligence had been exploring the interrelations among scores obtained by 
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by the English psychologist Charles Spearman (1904, 1927) during the first 
decade of the twentieth century. Subsequent methodological developments, 
based on the work of such American psychologists as T. L. Kelley (1928) and L. L. 
Thurstone (1938, 1947b), as well as on that of other American and English in- 
vestigators, have come to be known as factor analysis. 

The contributions that the methods of factor analysis have made to test con- 
struction will be more fully examined and illustrated in chapter 11. For the pre- 
sent, it will suffice to note that the data gathered by such procedures have 
indicated the presence of a number of relatively independent factors, or traits. 
Some of these traits were represented, in varying proportions, in the traditional 
intelligence tests. Verbal comprehension and numerical reasoning are examples 
of this type of trait. Others, such as spatial, perceptual, and mechanical aptitudes, 
were found more often in special aptitude tests than in intelligence tests. 

One of the chief practical outcomes of factor analysis was the development of 
multiple aptitude batteries. These batteries are designed to provide a measure of the 
individual's standing in each of a number of traits. In place of a total score or IQ, 
a separate score is obtained for such traits as verbal comprehension, numerical ap- 
titude, spatial visualization, arithmetic reasoning, and perceptual speed. Such bat- 
teries thus provide a suitable instrument for making the kind of intra-individual 
analysis, or differential diagnosis, that test users had been trying for many years to 
obtain, with crude and often erroneous results, from intelligence tests. These bat- 
teries also incorporate into a comprehensive and systematic testing program 
much of the information formerly obtained from special aptitude tests, since the 
multiple aptitude batteries cover some of the traits not ordinarily included in in- 
telligence tests. 

Multiple aptitude batteries represent a relatively late development in the test- 
ing field. Nearly all appeared since 1945. In this connection, the work of the mil- 
itary psychologists during World War II should be noted. Much of the test 
research conducted in the armed services was based on factor analysis and was di- 
rected toward the construction of multiple aptitude batteries. In the Air Force, for 
example, special batteries were constructed for pilots, bombardiers, radio opera- 
tors, range finders, and many other military specialists. A report of the batteries 
prepared in the Air Force alone occupies at least 9 of the 19 volumes devoted to 
the aviation psychology program during World War ll (Army Air Forces, 
1947-1948). Research along these lines is still in progress under the sponsorship 
of various branches of the armed services. A number of multiple aptitude batteries 
have likewise been developed for civilian use and are widely applied in educa- 
tional and vocational counseling and in personnel selection and classification. 
Examples of such batteries will be discussed in chapters 10 and 17. 

A more recent development, emerging in the late 1980s and early 1990s, pro- 
vides a fundamental integration of two previously contrasting approaches to men- 
tal measurement, represented by traditional intelligence tests and by multiple 
aptitude batteries (Anastasi, 1994). It is coming to be recognized that human abil- 
ity can be properly assessed at different levels of breadth, from the narrowly defined 
aptitudes of specific tests (or even individual items), through increasingly broader 
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trait levels, to an overall score such as the traditional IQ. For different. testing pur- 
poses, a different level of breadth is most appropriate. Accordingly, recently devel- 
oped intelligence tests, such as the Differential Ability Scales, or recent revisions 


chapter 8), combine comprehensive coverage of multiple aptitudes with flexible 
multilevel scoring for specific testing purposes. Although both of these examples 
are individually administered intelligence tests, the same comprehensive and flex- 
ible approach to test construction and use is having an impact on group-adminis- 
tered batteries, such as those discussed in chapter 10. The theoretical basis and 
practical implications of this ability-testing merger are discussed in chapter 11, in 
connection with current developments regarding the nature of intelligence. 


STANDARDIZED ACHIEVEMENT TESTS 


(O. W. Caldwell & Courtis, 1923; Ebel & Damrin, 1960). An important step in 
this direction was taken by the Boston public schools in 1845, when written ex- 


aminers. Among the arguments offered at the time in support of this innovation 
were that written examinations put all students in a uniform situation, permitted 
a wider coverage of content, reduced the element of chance in question choice, 
and eliminated the possibility of favoritism on the examiner's part. All these argu- 
ments have a familiar ring: They were used much later to justify the replacement 
of essay questions by objective multiple-choice items. 

After the turn of the century, the first standardized tests for measuring the out- 
comes of school instruction began to appear. Spearheaded by the work of E. L. 
Thorndike, these tests utilized measurement principles developed in the psycho- 
logical laboratory. Examples include scales for rating the quality of handwriting 
and written compositions, as well as tests in spelling, arithmetic computation, 
and arithmetic reasoning. Still later came the achievement batteries, initiated by 
the publication of the first edition of the Stanford Achievement Test in 1923. Its 
authors were three early leaders in test development: Truman L. Kelley, Giles M. 
Ruch, and Lewis M. Terman. Foreshadowing many characteristics of modern test- 
ing, this battery provided comparable measures of performance in different school 
subjects, evaluated in terms of a single normative group. 

At the same time, evidence was accumulating regarding the lack of agreement 
among teachers in grading essay tests. By 1930, it was widely recognized that essay 
tests not only were more time consuming for examiners and examinees but also 
yielded less reliable results than the "new type” of objective items.4 As the latter 


"Research beating on the relative effectiveness of essay and objective item types will be cited in chapter 17 in 
connection with the educational use of tests. 
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came into increasing use in standardized achievement tests, there was a growing 
emphasis on the design of items to test the understanding and application of 
knowledge and other broad educational objectives: The decade of the 1930s also 
witnessed the introduction of test-scoring machines, for which the new objective 
tests could be readily adapted. 

The establishment of statewide, regional, and national testing programs was 
another noteworthy parallel development. Probably the best known of these 
programs is thát of the College Entrance Examination Board (CEEB). Estab- 
lished at the turn of the century to reduce duplication in the examining of en- 
tering college freshmen, this program has undergone profound changes in its 
testing procedures and in the number and nature of participating colleges— 
changes that reflect intervening developments in both testing and education. 
In 1947, the testing functions of the CEEB were merged with those of the 
Carnegie Corporation and the American Council on Education to form Educa- 
tional Testing Service (ETS). In subsequent years, ETS has assumed responsi- 
bility for à growing number of testing programs on behalf of universities, 
professional schools, government agencies, and other institutions.. Mention 
should also be made of the American College Testing Program established in 
1959 to screen applicants to colleges not included in the CEEB program, and of 
several national testing programs for the selection of highly talented students 
for scholarship awards: 

Achievement tests are used not only for educational purposes but also in the 
selection of applicants for industrial and government jobs. Mention has already 
been made of the systematic use of civil service examinations in the Chinese em- 
pire, dating from about 150 B.C. (Bowman, 1989). In modern times, selection of 
government employees by examination was introduced in European countries in 
the late eighteenth and early nineteenth centuries. The U.S. Civil Service Com- 
mission’ installed competitive. examinations as a regular procedure in 1883 
(Kavruck, 1956). Test construction techniques developed during and prior to 
World War I were introduced into the examination program of the U.S. Civil 
Service with the appointment of L. J. O'Rourke as director of the newly estab- 
lished research division in 1922. Today this work is conducted by a large and 
technically sophisticated research staff within the unit designated as the U.S. Of- 
fice of Personnel Management. 

As more and more psychologists trained in psychometrics participated in the 
construction of standardized achievement tests, the technical aspects of achieve- 
ment tests increasingly came to resemble those of intelligence and aptitude tests. 
Procedures for constructing and evaluating all these tests have much in common. 
The increasing efforts to prepare achievement tests that. would measure the at- 
tainment of broad educational goals, as contrasted to the recall of factual minu- 
tiae, also made the content of achievement tests resemble more closely that of 
intelligence tests. Today the difference between these two types of tests, is chiefly 
one of degree of specificity of content and extent to which the test presupposes a 
designated course of prior instruction. 
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ASSESSMENT OF PERSONALITY 


Another area of psychological testing is concerned with the affective or nonintel- 
lectual aspects of behavior, to be discussed in chapters 13—16. Tests designed for 
this purpose are commonly known as personality tests, although many psycholo- 
gists prefer to use the term "personality" in a broader sense, to refer to the entire 
individual. Intellectual as well as nonintellectual traits would thus be included 
under this heading. In the terminology of psychological testing, however, the des- 
ignation "personality test" most often refers to measures of such characteristics as 
emotional states, interpersonal relations, motivation, interests, and attitudes. 

An early precursor of personality testing is illustrated by Kraepelin's use of the 
free association test with psychiatric patients. In this test, the examinee is given 
specially selected stimulus words and is required to respond to each with the first 
word that comes to mind. Kraepelin (1892) also employed this technique to study 
the psychological effects of fatigue, hunger, and drugs; he concluded that all these 
agents increase the relative frequency of superficial associations. Sommer (1894), 
also writing during the last decade of the nineteenth century, suggested that the 
free association test might be used to differentiate between the various forms of 
mental disorder. The free association technique has subsequently been utilized for 
a variety of testing purposes and is still currently employed. Mention should also 
be made of the work of Galton, Pearson, and Cattell in the development of stan- 
dardized questionnaire and rating-scale techniques. Although originally devised 
for other purposes, these procedures were eventually employed by others in con- 
structing some of the most common types of current personality tests. 

The prototype of the personality questionnaire, or self-report inventory (chap. 
13), is the Personal Data Sheet developed by Woodworth during World War | 
(DuBois, 1970; Franz, 1919, pp. 171-176; L. R. Goldberg, 1971; Symonds, 1931, 
chap. 5). This test was designed as a rough screening device for identifying seri- 
ously disturbed men who would be disqualified for military service. The inventory 
consisted of a number of questions dealing with common symptoms of psy- 
chopathology, which respondents answered about themselves. A total score was 
obtained by counting the number of symptoms reported. The development of the 
Personal Data Sheet was not completed early enough to permit its operational use 
before the war ended. Immediately after the war, however, civilian forms were 
prepared, including a special form for use with children. The Woodworth Per- 
sonal Data Sheet, moreover, served as a model for most subsequent emotional ad- 
justment inventories. In some of these questionnaizes, an attempt was made to 
subdivide emotional adjustment into more specific forms, such as home adjust- 
ment, school adjustment, and vocational adjustment. Other tests concentrated 
more intensively on a narrower area of behavior or were concerned with more dis- 
tinctly social responses, such as dominance-submission in interpersonal contacts. 
A later development was the construction of tests for quantifying the expression 
of attitudes and interests (chap. 14). These tests, too, were based essentially on 
questionnaire techniques. 

Another approach to the measurement of personality is through the applica- 
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tion of performance or situational tests (chap. 16). In such tests, the examinee has a 
task to perform whose purpose is often disguised. Most of these tests simulate 
everyday-life situations quite closely. The first extensive application of such tech- 
niques is to be found in the tests developed in the late 1920s and early 1930s by 
Hartshorne, May, and their associates (1928, 1929, 1930). This series, standard- 
ized on schoolchildren, was concerned with such behavior as cheating, lying, 
stealing, cooperativeness, and persistence. Objective, quantitative scores could be 
obtained on each of a large number of specific tests. Another illustration, for the 
adult level, is provided by the series of situational tests developed during World 
War II in the Assessment Program of the Office of Strategic Services (OSS, 
1948). These tests were concerned with relatively complex and subtle social and 
emotional behavior and required rather elaborate facilities and trained personnel 
for their administration. The interpretation of the individual’s responses, more- 
over, was relatively subjective. 

Projective techniques (chap. 15) represent a third approach to the study of per- 
sonality and one that has shown phenomenal growth, especially among clini- 
cians. In such tests, the client is given a relatively unstructured task that permits 
wide latitude in its solution. The assumption underlying such methods is that the 
individual will project her or his characteristic modes of response into such a task. 
Like the performance and situational tests, projective techniques are more or less 
disguised in their purpose, thereby reducing the chances that the respondent can 
deliberately create a desired impression. The previously cited free association test 
represents one of the earliest types of projective techniques. Sentence-completion 
tests have also been used in this manner. Other tasks commonly employed in pro- 
jective techniques include drawing, arranging toys to create a scene, extempora- 
neous dramatic play, and interpreting pictures or inkblots. 

All available types of personality tests present certain difficulties, both practical 
and theoretical. Each approach has its own special advantages and disadvantages. 
On the whole, personality testing has lagged behind ability testing in its practical 
accomplishments. But such lack of progress cannot be attributed to insufficient ef- 
fort. Research on the measurement of personality has attained impressive propor- 
tions since 1950, and many ingenious devices and technical improvements are 
under investigation. It is rather the special difficulties encountered in the measure- 
ment of personality that account for the slow advances in this area. 

Two significant unifying trends are emerging from current research with person- 
ality tests (see Anastasi, 1985b, 1992a, 1993; Digman, 1990; L. R. Goldberg, 
1993; Simon, 1994). First, there is increasing evidence of mutual influence be- 
tween affective (“personality”) and cognitive (“ability”) traits, in both task perfor- 
mance and behavior development. The traditional distinction between the two 
types of traits is coming to be recognized as artificially imposed for convenience in 
describing and measuring different aspects of behavior. Second, theoretical analy- 
ses of the nature and composition of personality support the reintegration of cog- 
nitive and affective traits into a comprehensive model of human activity that 
subsumes all forms of behavior. This comprehensive model relates to basic re- 
search on both intellectual traits (chap. 11) and affective traits (chap. 13). 
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art Two, which consists of chapters 3 to 7, provides an introduction to basic 
concepts and methodology needed for an understanding of psychological 
tests and for the proper interpretation of test results. The successive chap- 
ters deal with norms, reliability, validity, and item analysis and test design. 
The present chapter is concerned with the development and use of norms and 
with other procedures that facilitate the test user's interpretation of test scores. In 
the absence of additional interpretive data, a raw score on any psychological test 
is meaningless. To say that an individual has correctly solved 15 problems on a 
mathematical reasoning test, or identified 34 words in a vocabulary test, or suc- 
cessfully assembled a mechanical object in 57 seconds conveys little or no infor- 
mation about her or his standing in any of these fünctions. Nor do the familiar 
percentage scores provide a satisfactory solution to the problem of interpreting 
test scores. A score of 6596 correct on one vocabulary test, for example, might be 
equivalent to 3096 correct on another, and to 8096 correct on a third. The diffi- 
culty level of the items making up each test will, of. course, determine the mean- 
ing of the score. Like all raw Scores, percentage scores can be interpreted only in 
terms of a clearly defined and uniform frame of reference. 


62 


Norms and the Meaning of Test Scores | 63 


In order to ascertain more precisely the individual's exact position with refer- 
ence to the standardization sample, the raw score is converted into some relative 
measure. These derived scores are designed to serve a dual purpose. First, they in- 
dicate the individual's relative standing in the normative sample and thus permit 
an evaluation of her or his performance in reference to other persons. Second, 
they provide comparable measures that permit a direct comparison of the individ- 
ual’s performance on different tests. For example, if a girl has a raw score of 40 on 
a vocabulary test and a raw score of 22 on an arithmetic reasoning test, we obvi- 
ously know nothing about her relative performance on the two tests. Is she better 
in vocabulary or in arithmetic, or equally good in both? Since raw scores on dif- 
ferent tests are usually expressed in different units, a direct comparison of such 
scores is impossible. The difficulty level of the particular test would also affect 
such a comparison between raw scores. Derived scores, on the other hand, can be 
expressed in the same units and referred to the same or to closely similar norma- 
tive samples for different tests. The individual's relative performance in many dif- 
ferent functions can thus be compared. 

There are various ways in which raw scores may be converted to fulfill the two 
objectives stated above. Fundamentally, however, derived scores are expressed in 
one of two major ways: (1) developmental level attained, or (2) relative position 
within a specified group. Both these types of scores and some of their common 
variants are considered in separate sections of this chapter. But first it will be nec- 
essary to examine a few elementary statistical concepts that underlie the develop- 
ment and utilization of norms. The following section is included simply to clarify 
the meaning of certain common statistical measures. Simplified computational 
examples are given only for this purpose and not to provide training in statistical 
methods. For computational details and specific procedures to be followed in the 
practical application of these techniques, the reader is referred to any recent text- 
book on psychological statistics (e.g., D.C. Howell, 1997; Runyon & Haber, 
1991; West, 1991). There is growing recognition of the need for basic knowledge 
of statistical methodology, not only for test users but also for anyone who wants to 
read with understanding the published research reports in any area of psychology 
(L. S. Aiken, West, Sechrest, & Reno, 1990; Anastasi, 1991; Lambert, 1991; 
S. T. Meier, 1993). 


STATISTICAL CONCEPTS 


A major object of statistical method is to organize and summarize quantitative 
data in order to facilitate their understanding. A list of 1,000 test scores can be 
an overwhelming sight. In that form, it conveys little meaning. A first step in 
bringing order into such a chaos of raw data is to tabulate the scores into a fre- 
quency distribution, as illustrated in Table 3-1. Such a distribution is prepared by 
grouping the scores into convenient class intervals and tallying each score in the 
appropriate interval. When all scores have been entered, the tallies are counted 
to find the frequency, or number of cases, in each class interval. The sums of these 
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Frequency Distribution of Scores of 1,000 College Students on 
a Code-Learning Test 


Class Interval Frequency 

52-55 1 

48-51 1 

44-47 20 

40-43 3 

36-39 156 

i 32-35 328 

28-31 244 

24-21 136 

i _ 20-23 28 
16-19 8 

12-15 3 

n kV 8-11 : 2 
N= 1,000 


(Data from Anastasi, 1934, p. 34.) 


frequencies will equal N, the total number of cases in the group. Table 3-1 shows 
the scores of 1,000 college students in a code-learning test in which one set of ar- 
tificial words, or nonsense syllables, was to be substituted for another. The raw 
scores, giving number of correct syllables substituted during a two-minute trial, 
ranged from 8 to 52. They have been grouped into class intervals of 4 points, from 
52-55 at the top of the distribution down to 8-11. The frequency column reveals 
that two persons scored between 8 and 11, three ‘between 12 and 15, eight be- 
tween 16 and 19, and so on. 

The information provided by a frequency distribution can also be presented 
graphically in the form of a distribution curve. Figure 3-1 shows the data of Table 
3-1 in graphic form. On the baseline, or horizontal axis, are the scores grouped 
into class intervals; on the vertical axis are the frequencies, or number of cases 
falling within each class interval. The graph has been plotted in two ways, both 
forms being in common use. In the histogram, the height of the column erected 
over each class interval corresponds to the number of persons scoring in that in- 
terval. We can think of each individual standing on another's shoulders to form 
the column. In the frequency polygon, the number of persons in each interval is in- 
dicated by a point in the center of the class interval and across from the appropri- 
ate frequency. The successive points are then joined by straight lines. 

Except for minor irregularities, the distribution shown in Figure 3-1 resembles 
the bell-shaped normal curve. A mathematically determined perfect normal curve 
appears in Figure 3-2. This type of curve has important mathematical properties 
and provides the basis for many kinds of statistical analyses. For the present pur- 
pose, however, only a few features will be noted. Essentially, the curve indicates 
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Number of Cases 


36- 40- . 
39. 43. 


S 


Figure 3-1. Distribution Curves: Frequency Polygon and Histogram. 
(Data from Table 3-1.) 


that the largest number of cases cluster in the center of the range and that the 
number drops off gradually in. both directions as the extremes are approached. 
The curve is bilaterally symmetrical, with a single peak in the center. Most distri- 
butions of human traits, from height and weight to aptitudes and personality 
characteristics, approximate the normal curve. In general, the larger the group, 
the more closely will the distribution resemble the theoretical normal curve. 

A group of scores can also be described in terms of some measure of central ten- 
dency. Such a measure provides a single, most typical or representative score to 
characterize the performance of the entire group. The most familiar of these mea- 
sures is the average, more technically known as the mean (M), and it is found by 
adding all scores and dividing the sum by the number of cases (N). Another mea- 
sure of central tendency is the mode, or most frequent score. In a frequency distri- 
bution, the mode is the midpoint of the class interval with the highest frequency. 
Thus, in Table 3-1, the mode falls midway between 32 and 35, being 33.5. It will 
be noted that this score corresponds to the highest point on the distribution curve 
in Figure 3-1. A third measure of central tendency is the median, or middlemost 
score when all scores have been arranged in order of size. The median is the point 
that bisects the distribution, half the cases falling above it and half below. 
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Figure 3-2. Frequency Distributions with the Same Mean but Different 
Variability. 


Further description of a set of test scores is given by measures of variability, or the 
extent of individual differences around the central tendency. The most obvious 
and familiar way for reporting variability is in terms of the range between the high- 
est and lowest score. The range, however, is extremely crude and unstable, for it is 
determined by only two scores. A single unusually high or low score would thus 
markedly affect its size. A more precise method of measuring variability is based on 
the difference between each individual's score and the mean of the group. 

At this point, it will be helpful to look at the example in Table 3-2, in which 
the various measures under consideration have been computed on 10 cases. Such 
a small group was chosen in order to simplify the demonstration, although in ac- 
tual practice we would usually be dealing with larger groups. Table 3-2 serves also 
to introduce certain standard statistical symbols that should be noted for future 
reference. Original raw scores are conventionally designated by a capital X, and a 
small x is used to refer to deviations of each score from the group mean. The sym- 
bol X means "sum of.” It will be seen that the first column in Table 3-2 gives the 
data for the computation of mean and median. The mean is 40; the median is 
40.5, falling midway between 40 and 41—five cases (50%) are above the median 
and five below. There is little point in finding a mode in such a small group, since 
the cases do not show clear-cut clustering on any one score. Technically, however, 
41 would represent the mode, because two persons obtained this score, while all 
other scores occur only once. 

The second column shows how far each score deviates above or below the 
mean of 40. The sum of these deviations will always equal zero, because the posi- 
tive and negative deviations around the mean necessarily balance, or cancel each 
other out (+ 20 — 20 = 0). If we ignored signs, of course, we could average the 
absolute deviations, thus obtaining a measure of the average amount by which each 
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Illustration of Central Tendency and Variability 


Score (X) Deviation (x = X — M) Dev. Squared (x?) 
48 48 64 
4 +7 
ai 4 3e 
4l + 
Median = 40.5 ——» S 2 
38 -2 
Se 36 -4}-20 
cases 34 S -6 
32 -8, 
XX- 400 
EE S A 
dan MeL 


ta spares] EE i is 


Note. The symbols X and o given in this table are the capital and lowercase of the same Greek letter, pro- 
nounced "sigma." In many statistical writings, SD (or simply S) refers to the standard deviation of the sample 
on which the data were actually obtained, while o refers to the estimated value of the standard deviation in the 
population from which the sample was drawn. 


person deviates from the group mean. Although of some descriptive value; such 
an "average deviation" is not suitable for use in further mathematical analyses be- 
cause of the arbitrary discarding of signs and is therefore not employed in practice. 

A much more serviceable measure of variability is the standard deviation (sym- 
bolized by either SD or 6), in which the negative signs are legitimately eliminated 
by squaring each deviation. This procedure has been followed in the last column 


of Table 3-2. The sum of this column divided by the number of cases! (z2 


is known as the variance, or mean square deviation. The variance has proved 
extremely useful in sorting out the contributions of different factors to individual 
differences in test performance. For the present purposes, however, our chief 
concern is with the SD, which is the square root of the variance, as shown in 
Table 3-2. This measure is commonly employed in comparing the variability of 
different groups. In Figure 3-2, for example, are two distributions having the 
same mean but differing in variability. The distribution with wider individual 


'The computations illustrated in this chapter deal with descriptive statistics, which refers to the sample actually 
measured; in inferential statistics, the N is replaced by N-1 in order to estimate corresponding population values 
from sample data. The smaller the sample, the greater will be the difference between sample and population val- 
ues. For explanation, see any current statistics text (e.g., Comrey & Lee, 1992). 
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Figure 3-3. Percentage Distribution of Cases in a Normal Curve. 


difference yields a larger SD than the one with narrower individual differences. 
When assessing the relative performance of two groups, we should compare both 
SDs and means. If the two groups differ in variability, it may indicate group differ- 
ences in the proportion of high scorers, low scorers, or both, regardless of mean 
differences. Comprehensive procedures for combining the effects of mean differ- 
ences and SD differences are available (see, e.g., Feingold, 1995). 

The SD also provides the basis for expressing an individual's scores on different 
tests in terms of norms, as will be shown in the section on standard scores. The in- 
terpretation of the SD is especially ciear-cut when applied to a normal or approx- 
imately normal distribution curve. In such a distribution, there is an exact 
relationship between the SD and the Proportion of cases, as shown in Figure 3-3. 
On the baseline of this normal curve have been marked distances representing 
one, two, and three standard deviations above and below the mean. For instance, 
in the example given in Table 3-2, the mean would correspond to a score of 40; 
+16 would correspond to 44.9 (i.e., 40 -- 4.9); +20, to 49.8 (i.e., 40-- 2 x 4.9); and 
so on. The percentage of cases that fall between the mean and Ale in a normal 
curve is 34.13. Because the curve is symmetrical, 34.1396 of the cases are likewise 
found between the mean and —10, so that between +10 and Bo on both sides of 
the mean there are 68.2696 of the cases. Nearly all the cases (99.7296) fall within 
130 from the mean. These relationships are particularly relevant in the interpre- 
tation of standard scores and percentiles, discussed in later sections. 


DEVELOPMENTAL NORMS 


One way in which meaning can be attached to test scores is to indicate how far 
along the normal developmental path the individual has progressed. Thus an 

-year-old who performs as well as the average 10-year-old on an intelligence test 
may be described as having a mental age of 10; a mentally retarded adult who per- 
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forms at the same level would likewise be assigned an MA of 10. In a different 
context, a fourth-grade child may be characterized as reaching the sixth-grade 
norm in a reading test and the third-grade norm in an arithmetic test. Other de- 
velopmental systems utilize more highly qualitative descriptions of behavior in 
specific functions, such as sensorimotor activities or concept formation. However 
expressed, scores based on developmental norms tend to be psychometrically 
crude and do not lend themselves well to precise statistical treatment. Neverthe- 
less, they have considerable appeal for descriptive purposes, especially in the in- 
tensive clinical study of individuals and for certain research purposes. 


Mental Age. In chapter 2 it was noted that the term “mental age” was widely 
popularized through the various translations and adaptations of the Binet-Simon 
scales, although. Binet himself had employed the more neutral term “mental 
level." In age scales such as the Binet and its revisions (prior to 1986), items were 
grouped into year levels. For. example, those items passed by the. majority of 
-year-olds in the standardization sample were placed in the 7-year level, those 
passed by the majority of 8-year-olds were assigned to the 8-year level, and so 
forth. A child’s score on the test would then correspond to the highest year level 
that he or she could successfully complete. In actual practice, the individual’s 
performance showed a certain amount of scatter. In other words, examinees failed 
some tests below their mental age and passed some above it. For this reason, it was 
customary to compute the basal age, that is, the highest age at and below which all 
tests were passed. Partial credits, in months, were then added to this basal age for 
all tests passed at higher year levels. The child’s mental age on. the test was the sum 
of the basal age and the additional months of credit earned at higher age levels. 

Mental age norms have also been employed with tests that are not divided into 
year levels. In such a case, the child's raw score is first determined. Such a score 
may be the total number of correct items on the whole test; or it may be based on 
time, on number of errors, or on some combination of such measures. The mean 
raw scores obtained by the children in each year group within the standardization 
sample constitute the age norms for such a test. The mean raw score of the 8-year- 
old children, for example, would represent the 8-year norm. If an individual’s raw 
score is equal to the mean 8-year-old raw score, then her or his mental age on the 
test is 8 years. All raw scores on such a test can be transformed in a similar man- 
ner by reference to the age norms. 

It should be noted that the mental age unit does not remain constant with age 
but tends to shrink with advancing years. For example, a child who is one year re- 
tarded at age 4 will be approximately three years retarded at age 12. One year of 
mental growth from ages 3 to 4 is equivalent to three years of growth from ages 9 
to 12. Since intellectual development progresses more rapidly at the earlier ages 
and gradually decreases as the individual approaches her or his mature limit, the 
mental age unit shrinks correspondingly with age. This relationship may be more 
readily visualized if we think of the individual's height as being expressed in terms 
of "height age." The difference, in inches, between a height age of 3 and 4 years 
would be greater than that between a height age of 10 and 11. Owing to the 
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progressive shrinkage of the MA unit, one year of acceleration or retardation at, 
let us say, age 5 represents a larger deviation from the norm than does one year of 
acceleration or retardation at age 10. 


Grade Equivalents. Scores on educational achievement tests are often inter- 
preted in terms of grade equivalents. This practice is understandable because the 
tests are employed within a school setting. To describe a pupil's achievement as 
equivalent to seventh-grade performance in spelling, eighth-grade in reading, and 
fifth-grade in arithmetic has the same popular appeal as the use of mental age in 
the traditional intelligence tests. 

Grade norms are found by computing the mean raw score obtained by children 
in each grade. Thus, if the average number of problems solved correctly on an 
arithmetic test by the fourth-graders in the standardization sample is 23, then a 
raw score of 23 corresponds to a grade equivalent of 4. Intermediate grade equiva- 
lents, representing fractions of a grade, are usually found by interpolation, al- 
though they can also be obtained directly by testing children at different times 
within the school year. Because the school year covers ten months, successive 
months can be expressed as decimals. For example, 4.0 refers to average perfor- 
mance at the beginning of the fourth grade (September testing), 4.5 refers to av- 
erage performance at the middle of the grade (February testing), and so forth. 

Despite their popularity, grade norms have several shortcomings. First, the 
content of instruction varies somewhat from grade to grade. Hence, grade norms 
are appropriate only for common subjects taught throughout the grade levels cov- 
ered by the test. They are not generally applicable at the high school level, where 
many subjects may be studied for only one or two years. Even with subjects taught 
in each grade, however, the emphasis placed on different subjects may vary from 
grade to grade, and progress may therefore be more rapid in one subject than in 
another during a particular grade. In other words, grade units are obviously un- 
equal and these inequalities occur irregularly in different subject-matter areas. 

Grade norms are also subject to misinterpretation unless the test user keeps 
firmly in mind the manner in which they were derived. For example, if a fourth- 
grade child obtains a grade equivalent of 6.9 in arithmetic, it does not mean that 
she mastered the arithmetic processes taught in the sixth grade. She undoubtedly 
obtained her score largely by superior performance in fourth-grade arithmetic. It 
certainly could not be assumed that she has the prerequisites for seventh-grade 
arithmetic. Finally, grade norms tend to be incorrectly regarded as performance 
standards. A sixth-grade teacher, for example, may assume that all pupils in her 
class should fall at or close to the sixth-grade norm in achievement tests. This 
misconception is certainly not surprising when grade norms are used. Yet individ- 
ual differences within any one grade are such that the range of achievement test 
scores will inevitably extend over several grades. 


Ordinal Scales. Another approach to developmental norms derives from 
research in child psychology. Empirical observation of behavior development 
in infants and young children led to the description of behavior typical of suc- 
cessive ages in such functions as locomotion, sensory discrimination, linguistic 
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communication, and concept formation. An early example is provided by the 
work of Gesell and his associates at Yale (Ames, 1937; Gesell & Amatruda, 1947; 
Halverson, 1933; Knobloch & Pasamanick, 1974). The Gesell Developmental 
Schedules show the approximate developmental level in months that the child 
has attained in each of four major areas of behavior, namely, motor, adaptive, 
language, and personal-social. These levels are found by comparing the child’s 
behavior with that typical of eight key ages, ranging from 4 weeks to 36 months. 

Gesell and his co-workers emphasized the sequential patterning of early behav- 
ior development. They cited extensive evidence of uniformities of developmental 
sequences and an orderly progression of behavior changes. For example, the 
child’s reactions toward a small object placed in front of her or him exhibit a 
characteristic chronological sequence in visual fixation and in hand and finger 
movements. Use of the entire hand in crude attempts at palmar prehension oc- 
curs at an earlier age than use of the thumb in opposition to the palm; this type of 
prehension is in turn followed by the use of the thumb and index finger in a more 
efficient pincerlike grasp of the object. Such sequential patterning was likewise 
observed in walking, stair climbing, and most of the sensorimotor development of 
the first few years. The scales developed within this framework are ordinal in the 
sense that developmental stages follow in a constant order, each stage presuppos- 
ing mastery of prerequisite behavior characteristic of earlier stages. 

In the 1960s, there was a sharp upsurge of interest in the developmental theo- 
ries of the Swiss child psychologist Jean Piaget (see Flavell, 1963; Ginsburg & 
Opper, 1969; D. R. Green, Ford, & Flamer, 1971). Piaget's research focused on 
the development of cognitive processes from infancy to the midteens. He was 
concerned with specific concepts rather than broad abilities. An example of such 
a concept, or schema, is object permanence, whereby the child is aware of the 
identity and continuing existence of objects when they are seen from different 
angles or are out of sight. Another widely studied concept is conservation, or 
the recognition that an attribute remains constant over changes in perceptual 
appearance, as when the same quantity of liquid is poured into differently shaped 
containers, or when rods of the same length are placed in different spatial 
arrangements. 

Piagetian tasks have been used widely in research by developmental psycholo- 
gists, and some have been organized into standardized scales, to be discussed in 
chapter 9 (Goldschmid & Bentler, 1968b; Pinard & Laurendeau, 1964; Uigiris & 
Hunt, 1975). In accordance with Piaget's approach, these instruments are ordinal 
scales, in which the attainment of one stage is contingent upon completion of the 
earlier stages in the development of the concept. The tasks are designed to reveal 
the dominant aspects of each developmental stage; only later are empirical data 
gathered regarding the ages at which each stage is typically reached. In this 


?This usage of the term “ordinal scale" differs from that in statistics, in which an ordinal scale is simply one that 
permits a rank-ordering of individuals without knowledge about amount of difference between them; in the sta- 
tistical sense, ordinal scales are contrasted to equal-unit interval scales. Ordinal scales of child development are 
actually designed on the model of a Guttman scale or simplex, in which successful performance at one level im- 
plies success at all lower levels (L. Guttman, 1944). An extension of Guttman’s analysis to include nonlinear hi- 
erarchies is described by Bart and Airasian (1974), with special reference to Piagetian scales. 
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respect, the procedure differs from that followed in constructing age scales, in 
which items are selected in the first place on the basis of their differentiating be- 
tween successive ages. Although interest in the contributions of the Piagetian ap- 
proach continues, critical analyses and empirical evaluations have highlighted 
both its constructive features and its limitations (Sugarman, 1987).3 

In summary, ordinal scales are designed to identify the stage reached by the 
child in the development of specific behavior functions. Although scores may be 
reported in terms of approximate age levels, such scores are secondary to a quali- 


D 


ally able to do (e.g., climbs stairs without assistance; recognizes identity in quan- 
tity of liquid when poured into differently shaped containers), they share 
important features with the domain-referenced tests discussed in a later section of 


WITHIN-GROUP NORMS 


grade. Within-group scores have a uniform and clearly defined quantitative mean- 
ing and can be appropriately employed in most types of statistical analyses. 


arithmetic reasoning test, then a raw score of 15 corresponds to the 28th 
percentile (P35). A percentile indicates the individual's relative position in the 
standardization sample. Percentiles can also be regarded as ranks in a group of 
100, except that in ranking it is customary to start counting at the top, the best 
person in the group receiving a rank of one. With percentiles, on the other hand, 
we begin counting at the bottom, so that the lower the Percentile, the poorer the 
individual's standing. 

The 50th percentile (Pso) corresponds to the median, already discussed asa 
measure of central tendency. Percentiles above 50 represent above-average per- 
formance; those below 50 signify inferior performance. The 25th and 75th per- 
centiles are known as the first and third quartile points (Qi and Q3), because they 


provide convenient landmarks for describing a distribution of scores and compar- 
ing it with other distributions. 


Eeer 
?For further evaluation of the Piagetian approach, see chapter 9. 
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Percentiles should not be confused with the familiar percentage scores. The 
latter are raw scores, expressed in terms of the percentage of correct items; per- 
centiles are derived scores, expressed in terms of percentage of persons. A raw 
score lower than any obtained in the standardization sample would have a per- 
centile rank of zero (Po); one higher than any score in the standardization sample 
would have a percentile rank of 100 (P100). These percentiles, however, do not 
imply a zero raw score and a perfect raw score. 

Percentile scores have several advantages. They are easy to compute and can 
be readily understood, even by technically untrained persons. Moreover, per- 
centiles are universally applicable. They can be used equally well with adults and 
children and are suitable for any type of test, whether it measures aptitude or per- 
sonality variables. 

The chief drawback of percentile scores arises from the marked inequality 
of their units, especially at the extremes of the distribution. If the distribution of 
raw scores approximates the normal curve, as is true of most test scores, then raw 
score differences near the median or center of the distribution are exaggerated in 
the percentile transformation, whereas raw score differences near the ends of the 
distribution are greatly shrunk. This distortion of distances between scores can be 
seen in Figure 3-4. In a normal curve, it will be recalled, cases cluster closely at the 
center and scatter more widely as the extremes are approached. Consequently, 
any given percentage of cases near the center covers a shorter distance on the 
baseline than does the same percentage near the ends of the distribution. In Fig- 
ure 3-4, this discrepancy in the gaps between percentile ranks (PR) can readily be 
seen if we compare the distance between a PR of 40 and a PR of 50 with that be- 
tween a PR of 10 and a PR of 20. Even more striking is the discrepancy between 
these distances and that between a PR of 10 and PR of 1. (In a mathematically 


Figure 3-4. Percentile Ranks in a Normal Distribution. 
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derived normal curve, zero percentile is not reached until infinity and hence can- 
not be shown on the graph.) 

The same relationship can be seen from the opposite direction if we examine 
the percentile ranks corresponding to equal o-distances from the mean of a nor- 
mal curve. These percentile ranks are given under the graph in Figure 3-4. Thus, 
the percentile difference between the mean and Alois 34 (84 — 50). That be- 
tween *16 and 426 is only 14 (98 — 84). 

It is apparent that percentiles show each individual's relative position in the 
normative sample but not the amount of difference between scores, If plotted on 
arithmetic probability paper, however, percentile scores can also provide a correct 
visual picture of the differences between scores. Arithmetic probability paper is a 
cross-section paper in which the vertical lines are spaced in the same way as the 
percentile points in a normal distribution (as in Fig. 3-4), whereas the horizontal 
lines are uniformly spaced, or vice versa (as in Fig. 3-5). Such normal percentile 
charts can be used to plot the scores of different persons on the same test or the 
scores of the same person on different tests. In either case, the actual interscore 
difference will be correctly represented. Many aptitude and achievement batteries 


Figure .3- 5. A Normal Percentile Chart: Percentiles are spaced so as to correspond 
to equal distances in a normal distribution. Compare the score distance between John 


a : percentile difference 
is 5 points. Jane and Dick differ by 10 percentile points, as do Bill and Debby. 
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now utilize this technique in their score profiles, which show the individual's per- 
formance in each test. 


Standard Scores. Current tests are making increasing use of standard scores, 
which are the most satisfactory type of derived score from most points of view. 
Standard scores express the individual's distance from the mean in terms of the 
standard deviation of the distribution. 

Standard scores may be obtained by either linear or nonlinear transformations 
of the original raw scores. When found by a linear transformation, they retain the 
exact numerical relations of the original raw scores, because they are computed by 
subtracting a constant from each raw score and then dividing the result by an- 
other constant. The relative magnitude of differences between standard scores de- 
rived by such a linear transformation corresponds exactly to that between the raw 
scores. All properties of the original distribution of raw scores are duplicated in 
the distribution of these standard scores. For this reason, any computations that 
can be carried out with the original raw scores can also be carried out with linear 
standard scores, without any distortion of results. 

Linearly derived standard scores are often designated simply as "standard 
scores" or "z scores." To compute a z score, we find the difference between the in- 
dividual's raw score and the mean of the normative group and then divide this dif- 
ference by the SD of the normative group. Table 3-3 shows the computation of z 
scores for two individuals; one falls 1 SD above the group mean, the other .40 SD 
below the mean. Any raw score that is exactly equal to the mean is equivalent to 
a z score of zero. It is apparent that such a procedure will yield derived scores that 
have a negative sign for all persons falling below the mean. Moreover, because the 
total range of most groups extends no further than about 3 SDs above and below 
the mean, such standard scores will have to be reported to at least one decimal 
place in order to provide sufficient differentiation among individuals. 

Both the above conditions—that is, the occurrence of negative values and of 
decimals—tend to produce awkward numbers that are confusing and difficult to 
use for both computational and reporting purposes. For this reason, some fur- 
ther linear transformation is usually applied, simply to put the scores into a 


Computation of Standard Scores 
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more convenient form. For example, the scores on the Scholastic Assessment 
Tests (SAT) of the College Board are standard scores adjusted to a mean of 500 
and an SD of 100. Thus a standard score of —1 on this test would be expressed as 
400 (500 — 100 = 400). Similarly, a standard score of +1.5 would correspond to 
650 (500 + 1.5 x 100 = 650). To convert an original standard score to the new 
scale, it is simply necessary to multiply the standard score by the desired SD 
(100) and add it to or subtract it from the desired mean (500). Any other con- 
venient values can be arbitrarily chosen for the new mean and SD. Scores on 
the separate subtests of the Wechsler Intelligence Scales, for instance, are con- 
verted to a distribution with a mean of 10 and an SD of 3. All such measures are 
examples of linearly transformed standard scores. 

It will be recalled that one of the reasons for transforming raw scores into any 
derived scale is to render scores on different tests comparable. The linearly de- 
rived standard scores discussed in the preceding section will be comparable only 
when found from distributions that have approximately the same form. Under 
such conditions, a score corresponding to 1 SD above the mean, for example, sig- 
nifies that the individual occupies the same position in relation to both groups. 
The score exceeds approximately the same percentage of persons in both distribu- 
tions, and this percentage can be determined if the form of the distribution is 
known. If, however, one distribution is markedly skewed and the other normal, 
a z score of +1.00 might exceed only 50% of the cases in one group but would ex- 
ceed 8496 in the other. 

In order to achieve comparability of scores from dissimilarly shaped distribu- 
tions, nonlinear transformations may be employed to fit the scores to any speci- 
fied type of distribution curve. The mental age and percentile scores described in 
earlier sections represent nonlinear transformations, but they are subject to other 
limitations already discussed. Although under certain circumstances another type 
of distribution may be more appropriate, the normal curve is usually employed for 
this purpose. One of the chief reasons for this choice is that most raw score distri- 
butions approximate the normal curve more closely than they do any other type 
of curve. Moreover, physical measures such as height and weight, which use 
equal-unit scales derived through physical operations, generally yield normal dis- 
tributions. Another important advantage of the normal curve is that it has many 
useful mathematical properties, which facilitate further computations. 

Normalized standard scores are standard scores expressed in terms of a distribu- 
tion that has been transformed to fit a normal curve. Such scores can be com- 
puted by reference to tables giving the percentage of cases falling at different SD 
distances from the mean of a normal curve. First, the percentage of persons in the 
standardization sample falling at or above each raw score is found. This percent- 
age is then located in the normal curve frequency table, and the corresponding 
normalized standard score is obtained. Normalized standard scores are expressed 
in the same form as linearly derived standard scores, that is, with a mean of zero 
and an SD of 1. Thus, a normalized score of zero indicates that the individual falls 
at the mean of a normal curve, excelling 50% of the group. A score of 1 means 
that he or she surpasses approximately 16% of the group; and a score of +1, that 
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Normal Curve Percentages for Use in Stanine Conversion 
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he or she surpasses 84%. These percentages correspond to a distance of 1 SD 
below and 1 SD above the mean of a normal curve, respectively, as can be seen by 
reference to the bottom line of Figure 3-4. 

Like linearly derived standard scores, normalized standard scores can be put 
into any convenient form. If the normalized standard score is multiplied by 10 
and added to or subtracted from 50, it is converted into a T score, a type of score 
first proposed by W. A. McCall (1922). On this scale, a score of 50 corresponds 
to the mean, a score of 60 to 1 SD above the mean, and so forth. Another well- 
known transformation is represented by the stanine scale, developed by the 
United States Air Force during World War Il. This scale provides a single-digit 
system of scores with a mean of 5 and an SD of approximately 2.4 The name 
"stanine" (a contraction of “standard nine”) is based on the fact that the scores 
run from 1 to 9. The restriction of scores to single-digit numbers has certain 
computational advantages, since each score requires only a single column on a 
computer. 

Raw scores can readily be converted to stanines by arranging the original 
scores in order of size and then assigning stanines in accordance with the normal 
curve percentages reproduced in Table 3-4. For example, if the group consists of 
exactly 100 persons, the 4 lowest-scoring persons receive a stanine score of 1, 
the next 7 a score of 2, the next 12 a score of 3, and so on. When the group con- 
tains more or fewer than 100 cases, the number corresponding to each designated 
percentage is first computed, and these numbers of cases are then given the appro- 
priate stanines. Thus, out of 200 cases, 8 would be assigned a stanine of 1 (4% of 
200 = 8). With 150 cases, 6 would receive a stanine of 1 (4% of 150 = 6). For any 
group containing from 10 to 100 cases, Bartlett and Edgerton (1966) prepared a 
table whereby ranks can be directly converted to stanines. Because of their practi- 
cal as well as theoretical advantages, stanines have been used widely, especially 
with aptitude and achievement tests. 

Although normalized standard scores are the most satisfactory type of score for 
the majority of purposes, there are nevertheless certain technical objections to 
normalizing all distributions routinely. Such a transformation should be carried 
out only when the sample is large and representative and when there is reason to 
believe that the deviation from normality results from defects in the test rather 


"Kaiser (1958) proposed a modification of the stanine scale that involves slight changes in the percentages and 
yields an SD of exactly 2, thus being easier to handle quantitatively. Other, variants are the C scale (Guilford & 
Fruchter, 1978, pp. 484-487), consisting of 11 units and also yielding an SD of 2, and the 10-unit sten scale, with 


5 units above and 5 below the mean (Canfield, 1951). 
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than from characteristics of the sample and from other factors affecting the be- 
havior under consideration. It should also be noted that when the original distri- 
bution of raw scores approximates normality, the linearly derived standard scores 
and the normalized standard scores will be very similar. Although the methods of 
deriving these two types of scores are quite different, the resulting scores will be 
nearly identical under such conditions, Obviously, the process of normalizing a 
distribution that is already virtually normal will produce little or no change. 
Whenever feasible, it is generally more desirable to obtain a normal distribution 
of raw scores by proper adjustment of the difficulty level of test items rather than 
by subsequently normalizing a markedly nonnormal distribution. With an approx- 
imately normal distribution of raw scores, the linearly derived standard scores will 
serve the same purposes as normalized standard scores. 


The Deviation IQ. In an effort to convert MA scores into a uniform index of 
the individual's relative status, the ratio IQ (Intelligence Quotient) was 
introduced in early intelligence tests. Such an IQ was simply the ratio of mental 
age to chronological age, multiplied by 100 to eliminate decimals (IQ = 100 x 
MA/CA). Obviously if a child's MA equaled her or his CA, that child's IQ was 
exactly 100. An IQ of 100 thus represented normal or average performance. IQs 
below 100 indicated “retardation”; those above 100, “acceleration.” 

The apparent logical simplicity of the traditional ratio IQ, however, soon 
proved deceptive. A major technical difficulty is that, unless the SD of the IQ dis- 
tribution remaihs approximately constant with age, IQs will not be comparable at 
different age levels. An IQ of 115 at age 10, for example, may indicate the same 
degree of superiority as an IQ of 125 at age 12, since both may fall at a distance of 
1 SD from the means of their respective age distributions. In actual practice, it 
proved very difficult to construct tests that met the psychometric requirements for 


ratio IQ has been largely replaced by the so-called deviation 1Q, which is actually 
another variant of the familiar standard score. The deviation IQ is a standard 
score with a mean of 100 and an SD that approximates the SD of the Stanford- 


It should be added that the use of the term “IQ” to designate such standard 
scores may be somewhat misleading. Such IQs are not derived by the same meth- 
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ods employed in finding traditional ratio IQs. They are not ratios of mental ages 
and chronological ages. The justification lies in the general familiarity of the term 
“IQ,” and in the fact that such scores can be interpreted as IQs provided that their 
SD is approximately equal to that of previously known IQs. Among the first tests 
to express scores in terms of deviation IQs were the Wechsler Intelligence Scales. 
In these tests, the mean is 100 and the SD 15. Deviation IQs are also used in a 
number of current group tests of intelligence and in the third (1960) revision of 
the Stanford-Binet itself. 

With the increasing use of deviation IQs, it is important to remember that de- 
viation IQs from different tests are comparable only when they employ the same 
or closely similar values for the SD. This value should always be reported in the 
manual and carefully noted by the test user. If a test maker chooses a different 
value for the SD in making up the deviation IQ scale, the meaning of any given 
IQ on that test will be quite different from its meaning on other tests, These dis- 
crepancies are illustrated in Table 3-5, which shows the percentage’ of cases in 
normal distributions with SDs from 12 to 18 who would obtain IQs at different 
levels. These SD values have actually been employed in the IQ scales of published 
tests. Table 3-5 shows, for example, that an IQ of 70 cuts off the lowest 3.1% 
when the SD is 16 (as in the Stanford-Binet); but it may cut off as few as 0.7% 
(SD = 12) or as many as 5.1% (SD = 18). An IQ of 70 has been used traditionally 
as a cutoff point for identifying clinically significant mental retardation. The 
same discrepancies, of course, apply to IQs of 130 and above, which might be used 
in selecting children for special programs for the intellectually gifted. The IQ 
range between 90 and 110, generally described as normal, may include as few as 
42% or as many as 59.6% of the population, depending on the test chosen. To be 


Percentage of Cases of Each IQ Interval in Normal 
Distributions with Mean of 100 and Different Standard 
Deviations 
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sure, test publishers are making efforts to adopt the uniform SD of 16 in new tests 
and in new editions of earlier tests. There are still enough variations among cur- 
rently available tests, however, to make the checking of the SD imperative. 


Interrelationships of Within-Group Scores. At this stage in our discus- 
sion of derived scores, the reader may have become aware of a rapprochement 
among the various types of scores. Percentiles have gradually been taking on at 
least a graphic resemblance to normalized standard scores. Linear standard scores 
are indistinguishable from normalized standard scores if the original distribution 
of raw scores closely approximates the normal curve. Finally, standard scores have 
become IQs and vice versa. In connection with the last point, a reexamination of 
the meaning of a ratio IQ on such a test as the Stanford-Binet will show that 
these early ratio IQs can themselves be interpreted as standard scores. If we know 
that the distribution of Stanford-Binet ratio IQs had a mean of 100 and an SD of 
approximately 16, we can conclude that an IQ of 116 falls at a distance of I SD 
above the mean and represents a standard score of +1.00. Similarly, an IQ of 132 
corresponds to a standard score of 42.00, an IQ of 76 to a standard score of —1.50, 
and so forth. Moreover, a Stanford-Binet ratio IQ of 116 corresponds to a 
percentile rank of approximately 84, because in a normal curve 8496 of the cases 
fall below +1.00 SD (Fig. 3-4). 

Figure 3-6 summarizes the relationships that exist in a normal distribution 
among the types of scores so far discussed in this chapter. These include z scores, 
College Entrance Examination Board (CEEB) scores, Wechsler deviation IQs 
(SD = 15), T scores, stanines, and percentiles. Ratio IQs on any test will coincide 
with the given deviation IQ scale if they are normally distributed and have an SD 
of 15. Any other normally distributed IQ could be added to the chart, provided 
we know its SD. If the SD is 20, for instance, then an IQ of 120 corresponds to +1 
SD, an IQ of 80 to -1 SD, and so on. 

In conclusion, the exact form in which scores are reported is dictated largely by 
convenience, familiarity, and ease of developing norms. Standard scores in any 
form (including the deviation IQ) have generally replaced other types of scores 
because of certain advantages they offer with regard to test construction and sta- 
tistical treatment of data. Most types of within-group derived scores, however, are 
fundamentally similar if carefully derived and properly interpreted. When certain 
statistical conditions are met, each of these scores can be readily translated into 
any of the others. 


RELATIVITY OF NORMS 


Intertest Comparisons. An IQ, or any other score, should always be accom- 
panied by the name of the test on which it was obtained. Test scores cannot be 
properly interpreted in the abstract; they must be referred to particular tests. If the 
school records show that Bill Jones received an IQ of 94 and Terry Brown an IQ 
of 110, such IQs cannot be accepted at face value without further information. 
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Figure 3-6. Relationships among Different Types of Test Scores in a Normal 
Distribution. 


The positions of these two students might have been reversed by exchanging the 
particular tests that each was given in her or his respective school. 

Similarly, an individual's relative standing in different functions may be grossly 
misrepresented through lack of comparability of test norms. Let us suppose that a 
student has been given a verbal comprehension test and a spatial aptitude test to 
determine her relative standing in the two fields. If the verbal ability test was 
standardized on a random sample of high school students, while the spatial test 
was standardized on a selected group of students attending elective shop courses, 
the examiner might erroneously conclude that the individual is much more able 
along verbal than along spatial lines, when the reverse may actually be the case. 
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Still another example involves longitudinal comparisons of a single individ- 
ual's test performance over time. If a schoolchild's cumulative record shows IQs of 
118, 115, and 101 at the fourth, fifth, and sixth grades, the first question to ask be- 
fore interpreting these changes is, "What tests did the child take on these three 
occasions?" The apparent decline may reflect no more than the differences among 
the tests. In that case, the child would have obtained these scores even if the 
three tests had been administered within a week of one another. 

There are three principal reasons to account for systematic variations among 
the scores obtained by the same individual on different tests. First, tests may differ 
in content despite their similar labels. So-called intelligence tests provide many il- 
lustrations of this confusion. Although commonly described by the same blanket 
term, one of these tests may include only verbal content, another may tap pre- 
dominantly spatial aptitudes, and still another may cover verbal, numerical, and 
spatial content in about equal proportions. Second, the scale units may not be 
comparable. As explained earlier in this chapter, if IQs on one test have an SD of 
12 and IQs on another have an SD of 18, then an individual who received an IQ 
of 112 on the first test is most likely to receive an IQ of 118 on the second. Third, 
the composition of the standardization samples used in establishing norms for dif- 
ferent tests may vary. Obviously, the same individual will appear to have per- 
formed better when compared with a less able group than when compared with a 
more able group. 

Lack of comparability of either test content or scale units can usually be de- 
tected by reference to the test itself or to the test manual. Differences in the re- 
spective normative samples, however, are more likely to be overlooked. Such 
differences probably account for many otherwise unexplained discrepancies in 
test results. 


The Normative Sample. Any norm, however, expressed, is restricted to the 
particular normative population from which it was derived. The test user should 
never lose sight of the way in which norms are established. Psychological test 
norms are in no sense absolute, universal, or permanent. They merely represent 
the test performance of the persons constituting the standardization sample. In 
choosing such a sample, an effort is usually made to obtain a representative cross- 
section of the population for which the test is designed. 

In statistical terminology, a distinction is made between sample and population. 
The sample refers to the group of persons actually tested. The population desig- 
nates the larger, but similarly constituted, group from which the sample is drawn. 
For example, if we wish to establish norms of test performance for the population 
cf 10-year-old, urban, public school boys, we might test a carefully chosen sample 
of 500 10-year-old boys attending public schools in several American cities. The 
sample would be checked with reference to geographical distribution, socioeco- 
nomic level, ethnic composition, and other relevant characteristics to ensure that 
it was truly representative of the defined population. 

In the development and application of test norms, considerable attention 
should be given to the standardization sample. It is apparent that the sample on 
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which the norms are based should be large enough to provide stable values. An- 
other, similarly chosen sample of the same population should not yield norms that 
diverge appreciably from those obtained. Norms with a large sampling error 
would obviously be of little value in the interpretation of test scores. 

Equally important is the requirement that the sample be representative of the 
population under consideration. Subtle selective factors that might make the 
sample unrepresentative should be carefully investigated. A number of such selec- 
tive factors are illustrated in institutional samples. Because such samples are usu- 
ally large and readily available for testing purposes, they offer an alluring field for 
the accumulation of normative data. The special limitations of these samples, 
however, should be carefully analyzed. Testing persons in school, for example, will 
yield an increasingly superior selection of cases in the successive grades, owing to 
the progressive dropping out of the less able students: Nor does such elimination 
affect different subgroups equally. For example, the rate of selective elimination 
from school is greater for boys than for girls, and it is greater at lower than at 
higher socioeconomic levels. 

Selective factors likewise operate in other institutional samples, such as prison- 
ers, patients in mental hospitals, or institutionalized mentally retarded persons. 
Because of the many special factors that determine institutionalization itself, such 
groups are not representative of the entire population of law breakers, mentally 
disturbed persons, or mentally retarded persons. For example, mentally retarded 
individuals with physical handicaps are more likely to be institutionalized than 
are the physically fit. Similarly, the relative proportion of severely retarded per- 
sons will be much greater in institutional samples than in the total population. 

Closely related to the question of sample representativeness is the need for 
defining the specific population to which the norms can be generalized. Obvi- 
ously, one way of ensuring that a sample is representative is to restrict the popula- 
tion to fit the specifications of the available sample. For example, if the 
population is defined to include only 14-year-old schoolchildren rather than all 
14-year-old children, then a school sample would be representative. Ideally, of 
course, the desired population should be defined in advance in terms of the objec- 
tives of the test. Then a suitable sample should be assembled. Practical obstacles 
in obtaining participants, however, may make this goal unattainable: In such a 
case, it is far better to redefine the population more narrowly than to report norms 
on an ideal population that is not adequately represented by the standardization 
sample. In actual practice; very few tests are standardized on such broad popula- 
tions as is popularly assumed. No test provides norms for the human species! And 
it is doubtful whether many tests give truly adequate norms for such broadly de- 
fined populations as “adult American. men,” “10-year-old American children,” 
and the like. Consequently, the samples obtained by different test constructors 
may be unrepresentative of their alleged populations and biased in different ways. 
Hence, the resulting norms may not be comparable. 

When interpreting test scores, the test user should take into account the spe- 
cific influences that may have acted upon the normative sample employed in 
standardizing the particular test. Such influences would include special selective 
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factors, as well as societal conditions prevailing at the time the normative data 
were collected (Anastasi, 1985d). 


National Anchor Norms. One solution for the lack of comparability of 
norms is to use an anchor test to work out equivalency tables for scores on differ- 
ent tests. Such tables are designed to show what score in Test A is equivalent to 
each score in Test B. This can be done by the equipercentile method, in which 
scores are considered equivalent when they have equal percentiles in a given 
group. For example, if the 80th percentile in the same group corresponds to an IQ 
of 115 on Test A and to an 1Q of 120 on Test B, then Test-A-IQ 115 is considered 
equivalent to Test B-IQ-120. This approach has been followed to a limited extent 
by some test publishers, who have prepared equivalency tables for a few of their 
own tests (e.g., Lennon, 1966a). 

More ambitious proposals have been made from time to time for calibrating 
each new test against a single anchor test, which has itself been administered to a 
highly representative, national normative sample (Lennon, 1966b). An example 
of this procedure is provided by the Anchor Test Study conducted by Educational 
Testing Service under the auspices of the U. S. Office of Education (Jaeger, 1973). 
This study represents a systematic effort to provide comparable and truly repre- 
sentative national norms for seven of the most widely used reading achievement 


other groups, all the pairings were duplicated in reverse sequence, in order to con- 
trol for order of administration. From statistical analyses of all these data, score 


Scale (Rentz & Bashaw, 1977). The conversion table thus developed permits the 
transformation of a score from any form of the seven tests at any of the grade 
levels into a three-place score on a uniform, continuous scale. This scale was con- 
structed by using the item analysis and scaling methods of the Rasch model, one 
of the simplest of the item analysis models discussed in a later section of this 
chapter and described more fully in chapter 7. 

For many testing Purposes, it is useful to have comparable scores from different 
tests that are expressed on a uniform scale of Measurement and referred to a single 
normative sample. It should be noted, however, that there ate different degrees and 
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kinds of score comparability. The comparability obtained in particular situations de- 
pends upon the similarity of the tests in content and in such psychometric proper- 
ties as reliability and difficulty level, as well as on the statistical procedures used to 
achieve comparability (Angoff, 1984; Angoff & Cowell, 1986; P. W. Holland & 
Rubin, 1982). Tests should not be described as equated or fully equivalent unless 
they are truly interchangeable. Nevertheless, different kinds and degrees of compa- 
rability can facilitate the interpretation of test results, provided that the comparable 
scores are used appropriately and with full knowledge of how they were derived. 


Specific Norms. Another approach to the nonequivalence of existing norms 
—and probably a more realistic one for most tests—is to standardize tests on more 
narrowly defined populations, so chosen as to suit the specific purposes of each 
test. In such cases, the limits of the normative population should be clearly 
reported with the norms. Thus, the norms might be said to apply to “employed 
clerical workers in large business organizations” or to “first-year engineering 
students.” For many testing purposes, highly specific norms are desirable. Even 
when representative norms are available for a broadly defined population, it is 
often helpful to have separately reported subgroup norms. This is true whenever 
recognizable subgroups yield appreciably different scores on a particular test. The 
subgroups may be formed with respect to age, grade, type of curriculum, sex, 
geographical region, urban or rural environment, socioeconomic level, and many 
other variables. The use to be made of the test determines the type of dif- 
ferentiation that is most relevant, as well as whether general or specific norms 
are more appropriate. 

Mention should also be made of local norms, often developed by the test users 
themselves within a particular setting. The groups employed in deriving such 
norms are even more narrowly defined than the subgroups considered previously. 
Thus, an employer may accumulate norms on applicants for a given type of job 
within a particular company. A college admissions office may develop norms on 
its own student population. Or a single elementary school may evaluate the per- 
formance of individual pupils in terms of its own score distribution. These local 
norms are more appropriate than broad national norms for many testing purposes, 
such as the prediction of subsequent job performance or college achievement, the 
comparison of a child’s relative achievement in different subjects, or the measure- 


ment of an individual’s progress over time. 


Fixed Reference Group. Although the way most derived scores are com- 
puted provides an immediate normative interpretation of test performance, there 
are some notable exceptions. One type of nonnormative scale utilizes a fixed 
reference group in order to ensure comparability and continuity of scores, without 
providing normative evaluation of performance. With such a scale, normative 
interpretation requires reference to independently collected norms from a suitable 
population. Local or other specific norms are often used for this purpose. 


d | 
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One of the earliest examples of scaling in terms of a fixed reference group is 
provided by the score scale of the College Board Scholastic Aptitude Test? (Don- 
lon, 1984). Between 1926 (when this test was first administered) and 1941, SAT 
scores were expressed on a normative scale, in terms of the mean and SD of the 
candidates taking the test at each administration. As the number and variety of 
College Board member colleges increased and the composition of the candidate 
population changed, it was concluded that scale continuity should be maintained. 
Otherwise, an individual's score would depend on the characteristics of the group 
tested during a particular year. An even more urgent reason for scale continuity 
stemmed from the observation that students taking the SAT at certain times of 
the year performed less well than those taking it at other times, owing to the dif- 
ferential operation of selective factors. After 1941, therefore, all SAT scores were 
expressed in terms of the mean and SD of the approximately 11,000 candidates 
who took the test in 1941, These candidates constituted the fixed reference group 
employed in scaling subsequent forms of the test. Thus, a score of 500 on any form 
of the SAT corresponded to the mean of the 1941 sample; a score of 600 fell 1 SD 
above that mean, and so forth. 

To permit translation of raw scores on any form of the SAT into these fixed- 
reference-group scores, a short anchor test (or set of common items) was in- 
cluded in each form. Each new form was thereby linked to one or two earlier 
forms, which in turn were linked with other forms by a chain of items extending 
back to the 1941 form. These nonnormative SAT scores could then be inter- 
preted by comparison with any appropriate distribution of scores, such as that of 
a particular college, a type of college, a region, and so on. These specific norms 
are more useful in making college admission decisions than would be annual 
norms based on the entire candidate population. Any changes in the candidate 
population over time, moreover, can be detected only with a fixed-score scale. 
More recently, the SAT scale has been "recentered" on the performance of over. 
a million students who graduated from high school in 1990 and took the test as 
juniors or seniors. Scores for students completing the SAT after April 1, 1995, 
are reported on the "recentered" scale derived from the 1990 reference group. 
Interpretive materials and aids have been developed to assist test users in con- 
verting individual and aggregate scores from the former scale and vice versa (see 
chap. 17). A full and diversified interpretation of individual performance is 
thus permitted for specific testing purposes.° 

Scales built from a fixed reference group are analogous in one respect to scales 


employed in physical measurement. In this connection, Angoff (1962, pp. 32-33) 
commented: 


?This test was later renamed the Scholastic Assessment Test, in order to reflect the altered orientation toward the 
nature of test scores, which emerged toward the end of the twentieth century. (See especially chap. 12 for the ef- 
fect of individual differences in experiential history on test performance.) 


6We acknowledge the assistance of Wayne Camara of the College Board in obtaining the information reported 
herein. 
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There is hardly a person here who knows the precise original definition: of the 
length of the foot used in the measurement of height or distance, or which king it 
was whose foot was originally agreed upon as the standard; on the other hand, there 
is no one here who does not know how to evaluate lengths and distances in terms of 
this unit. Our ignorance of the precise original meaning or derivation of the foot 
does not lessen its usefulness to us in any way. Its usefulness derives from the fact 
that it remains the same over time and allows us to familiarize ourselves with it. 
Needless to say, precisely the same considerations apply to other units of measure- 
ment—the inch, the mile, the degree of Fahrenheit, and so on. In the field of psy- 
chological measurement it is similarly reasonable to say that the original definition 
of the scale is or should be of no consequence. What is of consequence is the main- 
tenance of a constant scale—which, in the case of a multiple-form testing program, 
is achieved by rigorous form-to-form equating—and the provision of supplementary 
normative data to aid in interpretation and in the formation of specific decisions, 
data which would be revised from time to time as conditions warrant. 


Item Response Theory. Since the 1970s, there has been a sharp upsurge 
of interest in a class of mathematically sophisticated procedures for scaling 
the difficulty of test items (Hambleton, 1989; Hambleton, Swaminathan, & 
Rogers, 1991; Jaeger, 1977). Because of the extensive computations required, 
these procedures became practicable only with the increasing availability of high- 
speed computers. Although differing in complexity and in specific mathematical 
procedures, these approaches were originally grouped under the general title of 
latent trait models. The basic measure they used is the probability that a person of 
specified ability (the so-called latent trait) succeeds on an item of specified 
difficulty. There is no implication, however, that such latent traits or underlying 
abilities exist in any physical or physiological sense, nor that they cause behavior. 
The latent traits are statistical constructs that are derived mathematically from 
empirically observed relations among test responses. A rough, initial estimate of 
an examinee’s latent trait is the total score he or she obtains on the test. In order 
to avoid the false impression created by the term “latent trait,” some of the lead- 
ing exponents of these procedures have substituted the more precisely descriptive 
term “item response theory,” or IRT (Lord, 1980; D. J. Weiss & Davison, 1981). 
This designation has achieved-general usage within psychology. 

Essentially, IRT models are used to establish a uniform “sample-free” scale of 
measurement that is applicable to individuals and groups of widely varying ability 
levels and to test content of widely varying difficulty levels. Like the use of a fixed 
reference group described in the preceding section, IRT models need anchor 
items or a common test as a bridge across examinee samples and across tests or 
sets of items, However, rather than using the mean and SD of a specific reference 
group to define the origin and the unit size of the scale, IRT models set origin and 
unit size in terms of data representing a wide range of ability and item difficulty, 
which may come from several samples. Usually the origin is set near the center of 
this range. The common scale unit is derived mathematically from the item data; 
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it has several advantages, both theoretical and practical, over the earlier item 
analysis procedures. The specific methodology will be discussed further in chapter 
7, in connection with techniques of item analysis. IRT is gradually being incorpo- 
rated in large-scale testing programs. For example, beginning in 1982, this proce- 
dure was adopted for equating total scores on the new forms of the SAT, so as to 
express them on the continuing, uniform scale (Camara, Freeman, & Everson, 
1996; Donlon, 1984). 

The general problem of test equating, whereby scores from different forms of a 
test are expressed on a uniform score scale, has received increasing attention. The 
technical problems of the diverse approaches that are being explored for this pur- 
pose are beyond the scope of this text. For a comprehensive survey and critical 
evaluation of such methodology, the reader is referred to P. W. Holland and Rubin 
(1982) and to Petersen, Kolen, and Hoover (1989). 


COMPUTERS AND THE INTERPRETATION OF TEST SCORES 


Technical Developments. Computers have had a major impact on every phase of 
testing, from test construction to administration, scoring, reporting, and 
interpretation (E B. Baker, 1989; Butcher, 1987; Gutkin & Wise, 1991; Roid, 1986). 
The obvious uses of computers—and those developed earliest—represented simply 
an unprecedented increase in the speed with which data analyses and scoring 
processes could be carried out. The use of computers in the automated admin- 
istration of conventional tests may also be considered in this category, insofar as they 
provide easier and better ways of administering such tests. Far more significant, 
however, is the contribution of computers to the exploration of new procedures and: 
approaches to psychological testing that would have been impossible without the 
flexibility and data-processing capabilities they provide. This effect of computers is 
illustrated by the increasing adoption of IRT models for sample-free scaling, cited in 
the preceding section. Other testing innovations resulting from computer utilization 
will be discussed under appropriate topics throughout the book. 

In the present connection, we shall examine some applications of computers to 
the evaluation of test performance (F B. Baker, 1989; Gutkin & Wise, 1991; Roid 
& Gorsuch, 1984). At the simplest level, most current tests, especially those de- 
signed for group administration, are now adapted for computer scoring. Several test 
publishers, as well as independent test-scoring organizations, are equipped to pro- 
vide such scoring services to test users. Moreover, there is also increasing avail- 
ability of computer disks that can be employed by test users to score tests on their 
own computers (e.g. the ASSIST programs developed by American Guidance 
Service). At a more complex level, narrative computer interpretation of test results 
is available for certain tests. In such cases, the computer program associates pre- 
pared verbal statements with particular patterns of test responses. This approach 
has been pursued with both personality and aptitude tests. For example, with the 
Minnesota Multiphasic Personality Inventory (MMPI), discussed in chapter 13, 
test users may obtain computer printouts of diagnostic and interpretive state- 
ments about the test taker's personality tendencies and emotional condition, to- 
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gether with the numerical scores. For test users who have access to their own 
computers, there are increasing opportunities to purchase computer programs 
that yield not only numerical scores but also interpretive reports for particular 
tests, such as the revised Wechsler intelligence scales for children (WISC-R) and 
for adults (WAIS-R). 

Individualized interpretation of test scores at a still more complex level is illus- 
trated by interactive computer systems, in which the individual is in direct contact 
with the computer by means of response stations and in effect engages in a dialogue 
with the computer (J. A. Harris, 1973; Holtzman, 1970; M. R. Katz, 1974; Super et 
al., 1970). This technique has been investigated with regard to educational and ca- 
reer planning and decision making. In such a situation, test scores are usually incor- 
porated in the computer data base, together with other information provided by 
the student or client. Essentially, the computer combines all the available informa- 
tion about the individual with stored data about educational programs and occupa- 
tions; and it utilizes all relevant facts and relations in answering the individual's 
questions and aiding her or him in reaching decisions. An example of such interac- 
tive computer systems is the System for Interactive Guidance Information (“SIGI,” 
1974—1975). In use for over a decade in colleges and universities, this system was 
subsequently updated and revised to serve not only students but also adults prepar- 
ing to enter or reenter the job market or considering career changes or advance- 
ment (M. R. Katz, 1993; Norris, Schott, Shatkin, & Bennett, 1986). 


Hazards and Guidelines. Although computers have undoubtedly opened the 
way for unprecedented improvements in all aspects of psychological testing, 
certain applications of computers may lead to misuses and misinterpretations of 
test scores (Butcher, 1985a; J. J. Kramer & Mitchell, 1985; Matarazzo, 1983, 
19862, 1986b). In the effort to guard against these hazards, considerable attention 
has been given to developing guidelines for computer-based testing. The Testing 
Standards (AERA, APA, NCME, 1985) include several standards pertaining to 
computer-based testing. A more comprehensive and detailed set of guidelines has 
been developed with special reference to computer uses in the various aspects of 
testing (see, e.g., Butcher, 1987, pp. 413—431). For a thorough evaluation of the 
use of computers in testing, with particular reference to computerized score 
interpretation, see Moreland (1985, 1992). 

Two of the major concerns regarding computerized testing pertain to score 
comparability and to narrative interpretive scoring. When the same test is admin- 
istered in computerized mode and in traditional printed mode, the comparability 
of scores needs to be investigated (Mazzeo, Druesne, Raffeld, Checketts, & 
Muhlstein, 1991). Unless the two modes are shown to yield fully equated test 
forms, the same set of norms may not be applicable to both; the reliability and va- 
lidity of the test may also vary. It is especially important to check score compara- 
bility for different individuals or groups whose experience with computer use, and 
particularly with computerized testing, may differ substantially. 

The rapid growth of computer services that provide narrative interpretive 
score reports has aroused particular concern. Two basic principles underlie many 
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of the relevant guidelines. First, adequate information must be provided to enable 
the test user to evaluate the reliability, validity, and other technical properties 
of the interpretive system employed in preparing the software program. How were 
the interpretive statements derived from the scores? What is the theoretical ration- 
ale and the research base of the system? Were the statements derived from quanti- 
tative analyses or from expert clinical judgment? If the latter, some information 
regarding the qualifications of the participating experts should be given. 

A second underlying principle is that, when interpretive score reports are used 
for clinical or counseling purposes, or if in other ways they enter into significant 
decisions about individuals, it is essential to take into account other available 
sources of data about test takers. For this reason, the score reports should be used 
only by fully qualified professionals. The score reports should be regarded as an aid 
to, rather than as a substitute for, the professional specialist. 


DOMAIN-REFERENCED TEST INTERPRETATION 


Nature and Uses. ` An approach to testing that aroused a surge of activity in the 
1970s, particularly in education, was initially designated as “criterion-referenced 
testing.” First proposed by Glaser (1963), this term is still used somewhat loosely, 
and its definition varies among different writers. Moreover, several alternative terms 
are in common use, such as content-, domain-, and objective-referenced. These 
terms are sometimes employed as synonyms for criterion-referenced and sometimes 
with slightly different connotations. Gradually, the more precisely descriptive terms 
have been replacing the earlier “criterion-referenced” designation. In this book, the 
term “domain-referenced” is henceforth used for this purpose. 

Typically domain-referenced testing uses as its interpretive frame of reference a 
specified content domain rather than a specified population of persons. In this re- 
spect, it has been contrasted with the usual norm-referenced testing, in which an 
individual’s score is interpreted by comparing it with the scores obtained by oth- 
ers on the same test. In domain-referenced testing, for example, a test taker's per- 
formance may be reported in terms of the specific kinds of arithmetic operations 
she has mastered, the estimated size of her vocabulary, the difficulty level of read- 
ing matter she can comprehend (from comic books to literary classics), or the 
chances of her achieving a designated performance level on an external criterion 
(educational or occupational). 

Thus far, domain-referenced testing has found its major applications in sev- 
eral innovations in education. Prominent among these are computer-assisted, 
computer-managed, and other individualized, self-paced instructional systems. In 
all these systems, testing is closely integrated with instruction, being introduced 
before, during, and after completion of each instructional unit to check on pre- 
requisite skills, diagnose possible learning difficulties, and prescribe subsequent 
instructional procedures (Nitko, 1989). 

From another angle, domain-referenced tests have been used in broad surveys of 
educational accomplishment, such as the National Assessment of Educational 
Progress (E. G. Johnson, 1992; Messick, Beaton, & Lord, 1983; E B. Womer, 1970), 
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and in meeting demands for educational accountability. From still another angle, 
testing for the attainment of minimum requirements, as in qualifying for a driver's 
license or a pilot's license, illustrates domain-referenced testing. A related appli- 
cation is in testing for job proficiency where the mastery of a small number of 
clearly defined job skills is to be assessed, as in military occupational specialties 
(Maier & Hirshfeld, 1978; Swezey & Pearlstein, 1975). 

Finally, familiarity with the concepts of domain-referenced testing can con- 
tribute to the improvement of the traditional, informal tests prepared by teachers 
for classroom use: Linn and Gronlund (1995) provide a detailed guide for this 
purpose, as well as-a simple and well-balanced treatment of domain-referenced 
testing. A brief: but excellent discussion of the chief limitations of domain- 
referenced tests is given by Ebel (1972). A comprehensive treatment of many of 
the technical problems in the construction and evaluation of domain-referenced 
tests can be found in Berk (1984a). 


Content Meaning. The major distinguishing feature of domain-referenced 
testing (however defined and whether designated by this term or by one of its 
synonyms) is its interpretation of test performance in terms of content meaning. 
The focus is clearly on what test takers can do and what they know, not on how 
they compare with others. A fundamental requirement in constructing this type 
of test is a clearly defined domain of knowledge or skills to be assessed by the test. 
If scores on such a test are to have communicable meaning, the content domain 
to be sampled must be widely recognized as important. The selected domain must 
then be subdivided into small units defined in performance terms. In an 
educational context, these units correspond to behaviorally defined instructional 
objectives, such as “multiplies three-digit by two-digit numbers" or "identifies the 
misspelled word in which the final e is retained when adding -ing.” In the pro- 
grams prepared for individualized instruction, these objectives may run to several 
hundred for a single school subject. After the instructional objectives have been 
formulated, items are prepared to sample each objective. This procedure is admit- 
tedly difficult and time consuming. Without careful specification and control of 
content, however, the results of domain-referenced testing could degenerate into 
an idiosyncratic and uninterpretable jumble. A practical compromise is to identi- 
fy and define key concepts, principles, methodologies, or instructional objectives 
through expert judgment; each of the significant domains thus defined can then 
be sampled thoroughly with appropriate test items. Undoubtedly, the degree of 
specificity with which behavior domains need to be assessed varies with the 
nature and purpose of the test (Popham, 1984; Roid, 1984). 

When strictly applied, domain-referenced testing is best adapted for testing 
basic skills (as in reading and arithmetic) at elementary levels. In these areas, in- 
structional objectives can usually be arranged in an ordinal hierarchy, the acquisi- 
tion of more elementary skills being prerequisite to the acquisition of higher-level 
skills.” It is impracticable and probably undesirable, however, to formulate highly 


"Ideally such tests follow the simplex model of a Guttman scale (see Popham & Husek, 1969), as do the Piaget- 
ian ordinal scales discussed in chapter 9. 
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specific objectives for advanced levels of knowledge in less highly structured sub- 
jects. At these levels, both the content and sequence of learning are likely to be 
much more flexible. 

On the other hand, in its emphasis on content meaning in the interpretation 
of test scores, domain-referenced testing may exert a salutary effect on testing in 
general. The interpretation of intelligence test scores, for example, would benefit 
from this approach. To describe a child’s intelligence test performance in terms of 
the specific intellectual skills and knowledge it represents might help to counter- 
act the confusions and misconceptions that have become attached to the tradi- 
tional IQ. When stated in these general terms, however, the domain-referenced 
approach is equivalent to interpreting test scores in the light of the demonstrated 
validity of the particular test, rather than in terms of vague underlying entities. 
Suchan interpretation can certainly be combined with norm-referenced scores. 


Mastery Testing. A second major feature commonly associated with domain- 
referenced testing is the procedure of testing for mastery. Essentially, this 
procedure yields an all-or-none score, indicating that the individual has or has 
not attained the preestablished level of mastery. When basic skills are tested, 
neatly complete mastery is generally expected (e.g. 80 to 8596 correct items). A 
three-way distinction may also be employed, including mastery, nonmastery, and 
an intermediate, doubtful, or “review” interval. 

In connection with individualized instruction, some educators have argued 
that, given enough time and suitable instructional methods, nearly everyone can 
achieve complete mastery of the chosen instructional objectives. Individual dif- 
ferences would thus be manifested in learning time rather than in final achieve- 
ment as in traditional educational testing (Carroll, 1963, 1970; Cooley & Glaser, 
1969; Gagné, 1965). It follows that in mastery testing, individual differences in 
performance are of little or no interest. Hence, as generally constructed, domain- 
referenced tests minimize individual differences in performance after appropriate 
training. Mastery testing is regularly employed in the previously cited programs 
for individualized instruction. It is also characteristic of published domain- 
referenced tests for basic skills, suitable for elementary school. 

In the construction of such tests, two important questions are: (1) How many 
items must be used for reliable assessment of each of the specific instructional ob- 
jectives covered by the test? (2) What Proportion of items must be correct for the 
reliable establishment of mastery? In much of the early domain-referenced test- 
ing, these two questions were answered by judgmental decisions. Substantial 
progress has been made, however, in developing appropriate statistical techniques 
that may provide objective, empirical answers (Berk, 1984a; R. L. Ferguson & 
Novick, 1973; Hambleton, 19842, 1989; Hambleton & Novick, 1973). A few ex- 
amples will serve to illustrate the nature and scope of these efforts. 

The two questions about number of items and cutoff score can be incorporated 
into a single hypothesis, amenable to testing within the framework of decision 
theory and sequential analysis (Hambleton, 1984a; Wald, 1947). Specifically, we 
wish to test the hypothesis that the test taker has achieved the required level of 
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mastery in the content domain or instructional objective sampled by the test 
items. Sequential analysis consists of taking observations one at a time and decid- 
ing after each observation whether to (1) accept the hypothesis, (2) reject the hy- 
pothesis, or (3) make additional observations. Thus, the number of observations 
(in this case, number of items) needed to reach a reliable conclusion is itself de- 
termined during the process of testing. Rather than being presented with a fixed, 
predetermined number of items, the examinee continues taking the test until a 
mastery or nonmastery decision is reached. At that point, testing is discontinued 
and the student is either directed to the next instructional level or returned to the 
nonmastered level for further study. With the computer facilities described earlier 
in this chapter, such sequential decision procedures are practicable and can re- 
duce total testing time while yielding reliable estimates of mastery. 

Some investigators have been exploring the use of Bayesian estimation tech- 
niques which incorporate collateral data and lend themselves well to the kind of 
decisions required by mastery testing. Because of the large number of specific in- 
structional objectives to be tested, domain-referenced tests typically provide only 
a small number of items for each objective. To supplement this limited information, 
procedures have been developed for incorporating collateral data from the student's 
previous performance history, as well as from the test results of other students 
(R. L. Ferguson & Novick, 1973; Hambleton, 1984a; Hambleton & Novick, 1973). 

When individually tailored procedures are impracticable, cutoff scores can be 
established empirically by analyzing preinstruction and postinstruction scores of 
appropriate groups on the given test. À cutting score is then selected that best dis- 
criminates between those who have and those who have not received the rele- 
vant training (Panell & Laabs, 1979; L. A. Shepard, 1984). Judgment is required 
in specific situations to assess the relative seriousness of "passing" a person who is 
not qualified versus "failing" one who is. The cutoff would be accordingly raised 
or lowered to adjust to the seriousness of the consequences of misclassification. 


Relation to Norm-Referenced Testing. Beyond basic skills, mastery testing 
is inapplicable or insufficient. In more advanced and less structured subjects, 
achievement is open-ended. The individual may progress almost without limit in 
such functions as understanding, critical thinking, appreciation, and originality. 
Moreover, content coverage may proceed in many different directions, depending 
upon the individual’s abilities, interests, and goals, as well as local instructional 
facilities. Under these conditions, complete mastery is unrealistic and unneces- 
sary. Hence, norm-referenced evaluation is generally employed in such cases to 
assess degree of attainment. Some published tests are so constructed as to permit 
both norm-referenced and domain-referenced applications. An example is 
provided by the Stanford diagnostic tests in reading and in mathematics. While 
providing appropriate norms at each level, these tests permit a qualitative analysis 
of the child’s attainment of detailed instructional objectives. 

It should be noted that domain-referenced testing is neither as new nor as 
clearly divorced from norm-referenced testing as some of its proponents implied. 
Evaluating an individual’s test performance in absolute terms, such as by letter 
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grades or percentage of correct items, is certainly much older than normative in- 
terpretations. More precise attempts to describe test performance in terms of con- 
tent meaning also antedate the introduction of the term "criterion-referenced 
testing" (Ebel, 1962; J. C. Flanagan, 1962; Nitko, 1984, pp. 14-16). Other exam- 
ples may be found in early product scales for assessing the quality of handwriting, 
compositions, or drawings by matching the individual's work sample against a set 
of standard specimens. Ebel (1972) observed, furthermore, that the concept of 
mastery in education—in the sense of all-or-none learning of specific units— 
achieved considerable popularity in the 1920s and 1930s and was later abandoned. 

A normative framework is implicit in all testing, regardless of how scores are 
expressed (Angoff, 1974; Nitko, 1984). The very choice of content or skills to be 
measured is influenced by the examiner's knowledge of what can be expected 
from human organisms at a particular developmental or instructional stage. Such 
a choice presupposes information about what other persons have done in similar 
situations. Moreover, by imposing uniform cutoff scores on an ability continuum, 
mastery testing does not thereby eliminate individual differences. To describe an 
individual's level of reading comprehension as “the ability to understand the con- 
tent of The New York Times" still leaves room for a wide range of individual differ- 
ences in degree of understanding. Applying a cutoff point to dichotomize 
performance simply ignores the remaining individual differences within the two 
categories and discards potentially useful information. 


MINIMUM QUALIFICATIONS AND CUTOFF SCORES 


Practical Needs and Pitfalls. The concept of mastery in domain-referenced 
testing is only one example of the practical use of cutoff scores in decision 
making. Minimum qualifications must be specified and implemented for a 
multitude of purposes in everyday life. In many situations, safety considerations 
require the setting of cutoff points in performance, as in granting a driver's 
license, selecting airline pilots, or employing workers in a nuclear plant. In 
education, passing a course or graduating from a school presents other situations 
that demand all-or-none classifications of persons (Jaeger, 1989). In clinical and 
counseling practice, decisions regarding treatment or action recommendations 
may call for similar judgments. 

A particularly strong argument for the use of cutoff scores pertains to the pres- 
ence of critical variables required for the performance of certain functions. These 
are variables in which a deficiency could not be compensated by outstanding abil- 
ity in other variables. In such cases, a high score on a comprehensive selection 
battery could mask a deficiency on a critical skill. Yet individuals falling below the 
required minimum on the essential skill would fail, regardless of their other abili- 
ties. For example, operators of sonar equipment must have good auditory discrim- 
ination. During World War II, U.S. Navy recruits were initially selected for 
training as sonar operators on the basis of their combined scores on tests of audi- 
tory discrimination and mechanical comprehension. As a result, a number of 
college-trained men who excelled in mechanical comprehension but happened to 
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be deficient in the essential auditory skills were assigned to such training, with 
subsequent failure. Standard Navy procedure required that those failing in their 
first training assignment be transferred to general sea duty as apprentice seamen— 
thereby losing their potential specialized services. Further analysis of the situation 
led in time to the substitution of a cutoff screening procedure for this selection 
purpose. For most job-related variables, however, the relation to job performance 
tends to be linear, so that the higher the test score the better the performance 
(Coward & Sackett, 1990). In such cases, a person's actual score is a better pre- 
dictor than her or his position relative to a cutoff point. 

Insofar as the use of cutoff scores cannot be avoided in many practical deci- 
sions, it is essential to be aware of the pitfalls in such evaluations and to utilize 
procedures to reduce judgment errors. For example, efforts should be made to mit- 
igate the limitations of a single test score. When feasible, the cutoff should be a 
band of scores, rather than a single score obtained in one administration of a par- 
ticular test. Moreover, decisions about individuals should depend on multiple 
sources of information about each person, wherein test scores are supplemented 
by other relevant performance data, both past and present. If cutoff points on 
tests are set by a panel of judges, the panel should include adequate representation 
of experts on both the relevant area of task performance and the principles of test 
construction and use. Above all, cutoff scores should whenever possible be estab- 
lished or verified on the basis of empirical data. Specifically, this implies that test 
scores be obtained from groups that clearly differ on the relevant criterion behav- 
ior, such as the actual performance of a given type of job. It is this performance, of 
course, that the test is designed to predict and in which the cutoff is designed to 
ensure a safe, or acceptable, or desirable minimum. A clear illustration of an em- 
pirical method for setting cutoff scores on a personnel selection test is provided by 
the expectancy tables discussed in the following section. 


Expectancy Tables. One of the ways of interpreting the meaning of a test 
score is in terms of the person’s expected criterion performance, as in a training 
program or on a job. This usage of the term “criterion” follows standard 
psychometric practice, as when a test is said to be validated against a particular 
criterion (see chap. 1). Strictly speaking, the term "criterion-referenced testing” 
should refer to this type of performance interpretation, while the other 
approaches discussed in the preceding section can be more precisely described as 
content-referenced or domain-referenced. 

An expectancy table gives the probability of different criterion outcomes for 
persons who obtain each test score. For example, if a student obtained a score of 
530 on the College Board Scholastic Assessment Test (SAT), what are the 
chances that her or his freshman grade-point average in a specific college will fall 
in the A, B, C, D, or F category? This type of information can be obtained by 
examining the bivariate distribution of predictor scores (from SAT) plotted 
against criterion status (freshman grade-point average). If the number of cases in 
each cell of such a bivariate distribution is changed to a percentage, the result is an 
expectancy table, such as the one illustrated in Table 3-6. The data for this table 
were obtained from 211 seventh-grade students enrolled in mathematics courses. 
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Table 3-6 
Expectancy Table Showing Relation between DAT Numerical 


| Reasoning Test and Course Grades in Mathematics for 211 
Students in Grade 7 


Percentage Receiving Each Grade 


Number 


Test Score EE 


30 & above 
20-29 9 21 43 27 
10-19 37 37 24 3 


Below 10 


(Adapted from Technical Manual for Differential Aptitude Tests, 5th ed., p. 152. Reproduced by 
permission. Copyright © 1992 by The Psychological Corporation.) 


The predictor was the Numerical Reasoning test of the Differential Aptitude Tests, 
administered at end of first semester. The criterion was final, second-semester 
course grades. The correlation between test scores and criterion was .60. 

The first column of Table 3-6 shows the test scores, divided into four class in- 
tervals; the number of students whose scores fell into each interval is given in the 
second column, The remaining entries in each row of the table indicate the per- 
centage of cases within each test-score interval who received each grade at the 
end of the course. Thus, of the 22 students with scores of 30 or above on the Nu- 
merical Reasoning test, 5% received grades of D or below, none received C, 36% 
received B, and 59% A. At the other extreme, of the 14 students scoring below 10 
on the test, 43% received grades of D or below, 36% C, and 14% B. The anom- 
alous 7% receiving A, representing only one case, is virtually useless information 
for purposes of generalizability, as is the 5% with test-score of 30 or above who re- 
ceived a grade of D or lower, again represented by one case. Nevertheless, within 
the limitations of the available data, the percentages in Table 3-6 provide esti- 
mates of the probability that an individual will receive a given criterion grade. 
For example, if a new student receives a DAT Numerical Reasoning score of 24 
(i.e., in the 20-29 interval), we would conclude that the probability of her or his 
obtaining a course grade of A is 27 out of 100; the probability of her or his obtain- 
ing a grade of B is 43 out of 100, and so on. 

In many practical situations, criteria can be dichotomized into "success" and 
"failure" in a job, a course of study, or other undertaking. Under these conditions, 
an expectancy chart can be prepared, showing the probability of success or failure 
corresponding to each score interval. Figure 3—7 is an example of such an ex- 
pectancy chart. Based on a pilot selection battery developed by the U.S. Air 
Force, this expectancy chart shows the percentage of pilot cadets scoring within 
each stanine on the battery who failed to complete primary flight training. It can 
be seen that 7796 of the cadets receiving a stanine of 1 were eliminated in the 
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Pilot No. of © : pu 
Stanine Men 
9 21474 . 
8 19,444 
7 32,129 
6 39,398 
5 34,975 


4 23,699 . 


8 11209 


Figure 3-7. Expectancy Chart Showing Relation between Performance on Pilot 
Selection Battery and Elimination from Primary Flight Training. 


(From J. C. Flanagan, 1947, p. 58:) 


course of training, while only 496 of those at stanine 9 failed to complete the 
training satisfactorily. Between these extremes, the percentage of failures de- 
creases consistently over the successive stanines. On the basis of this expectancy 
chart, it could be predicted, for example, that approximately 40% of pilot cadets 
who obtain a stanine score of 4 will fail and approximately 6096 will satisfactorily 
complete primary flight training. Similar statements regarding the probability of 
success and failure could be made about individuals who receive each stanine. 
Thus, an individual with a stanine score of 4 has a 60:40 or 3:2 chance of com- 
pleting primary flight training. Besides providing a criterion-referenced interpre- 
tation of test scores, it can be seen that both expectancy tables and expectancy 
charts give a general idea of the validity of a test in predicting a given criterion. 
For this reason, a more detailed discussion of empirical procedures for setting cut- 
off scores is given at the end of chapter 6, in a section on decision models for fair 
use of tests, Also included in that section are references to mathematical proce- 
dures for setting optimal cutoff scores under various conditions. Specific applica- 
tions of cutoff scores in major areas of psychological practice may also be found in 
chapter 17. 


Reliability 


eliability” refers to the consistency of scores obtained by the same persons 

when they are reexamined with the same test on different occasions, or 

with different sets of equivalent items, or under other variable examining 

conditions. This concept of reliability underlies the computation of the error 

of measurement of a single score, whereby we can predict the range of fluctuation 

likely to occur in a single individual's score as a result of irrelevant or unknown 
chance factors. 

The concept of reliability has been used to cover several aspects of score con- 
sistency. In its broadest sense, test reliability indicates the extent to which indi- 
vidual differences in test scores are attributable to "true" differences in the 
characteristics under consideration and the extent to which they are attributable 
to chance errors. To put it in more technical terms, measures of test reliability 
make it possible to estimate what proportion of the total variance of test scores is 
error variance. These are not "errors" in the sense that they could be avoided or 
corrected through improved methodology. The error terminology has been inher- 
ited from an earlier era in psychology when interest centered on finding general 
laws of behavior and assessing individuals on what were assumed to be rigidly 
fixed underlying traits. Today, psychologists recognize variability as an intrinsic 
property of all behavior and seek to investigate and sort out the many sources of 
such variability. 

With regard to score reliability, the crux of the matter lies in the definition of 
error variance. Factors that might be considered error variance for one purpose 
would be classified under true variance for another. For example, if we are inter- 
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ested in measuring fluctuations of mood, then the day-by-day changes in scores 
on a test of cheerfulness-depression would be relevant to the purpose of the test 
and would hence be part of the true variance of the scores. If, on the other hand, 
the test is designed to measure more permanent personality characteristics, the 
same daily fluctuations would fall under the heading of error variance. 

Essentially, any condition that is irrelevant to the purpose of the test represents 
error variance. Thus, when examiners try to maintain uniform testing conditions 
by controlling the testing environment, instructions, time limits, rapport, and 
other similar factors, they are reducing error variance and making the test scores 
more reliable. Despite optimum testing conditions, however, no test is a perfectly 
reliable instrument. Hence, every test should be accompanied by a statement of 
its reliability. Such a measure of reliability characterizes the test when it is admin- 
istered under standard conditions and given to persons similar to those constitut- 
ing the normative sample. The characteristics of this sample should therefore be 
specified, together with the type of reliability that was measured. 

There could, of course, be as many varieties of test reliability as there are 
conditions affecting test scores, since any such conditions might be irrelevant 
for a certain purpose and would thus be classified as error variance. The types of 
reliability computed in actual practice, however, are relatively few. In this chap- 
ter, the principal techniques for measuring the reliability of test scores will be 
examined, together with the sources of error variance identified by each.! 

Since all types of reliability are concerned with the degree of consistency or 
agreement between two independently derived sets of scores, they can all be ex- 
pressed in terms of a correlation coefficient. Accordingly, the next section will con- 
sider some of the basic characteristics of correlation coefficients, in order to clarify 
their use and interpretation. More technical discussion of correlation, as well as 
more detailed specifications of computing procedures, can be found in any ele- 
mentary textbook of educational or psychological statistics, such as Runyon and 
Haber (1991) or D. C. Howell (1997). 


THE CORRELATION COEFFICIENT 


Meaning of Correlation. Essentially, a correlation coefficient (r) expresses 
the degree of correspondence, or relationship, between two sets of scores. Thus, if 
the top-scoring individual in variable 1 also obtains the top score in variable 2, the 
second-best individual in variable 1 is second best in variable 2, and so on down to 
the poorest individual in the group, then there would be a perfect correlation 
between variables 1 and 2. Such a correlation would have a value of +1.00. 


VThis approach to score reliability has sometimes been called a generalizability theory of reliability (see Bren- 
nan, 1994; Crick & Brennan, 1982; Cronbach, Gleser, Nanda, & Rajaratnam, 1972; Feldt & Brennan, 1989; 
Shavelson & Webb, 1991). This is not a sufficiently specific designation, however, because generalizability ap- 
plies to all aspects of a test score and, in fact, to all scientific data. A more precise description of this reliability 
procedure is based on its identification of variance components as relevant or irrelevant. 
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A hypothetical illustration of a perfect positive correlation is shown in Figure 
4-1. This figure presents 3 scatter diagram, or bivariate distribution. Each tally 
mark in this diagram indicates the score of one person in both variable 1 (hori- 
zontal axis) and variable 2 (vertical axis). It will be noted that all of the 100 cases 
in the group are distributed along the diagonal running from the lower left- to the 
upper right-hand corner of the diagram. Such a distribution indicates a perfect 
positive correlation (+1.00), since it shows that each person occupies the same 
relative position in both variables. The closer the bivariate distribution of scores 
approaches this diagonal, the higher will be the positive correlation. 

Figure 4-2 illustrates a perfect negative correlation (—1.00). In this case, there 
is acomplete reversal of scores from one variable to the other. The best individual 
in variable 1 is the poorest in variable 2 and vice versa, this reversal being consis- 
tently maintained throughout the distribution. It will be noted that, in this scat- 
ter diagram, all persons fall on the diagonal extending from the upper left- to the 
lower right-hand corner. This diagonal runs in the reverse direction from that in 
Figure 4-1. 

A zero correlation indicates complete absence of relationship, such as might 
occur by chance. If each person’s name were pulled at random out of a box to de- 


Score on Variable 2 — 


Figure 4-1. Bivariate Distribution for a Hypothetical Correlation of +1.00. 
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Score on Variable 2 


30-39 


20-29 


10-19 


Figure 4-2. Bivariate Distribution for a Hypothetical Correlation of ^1.00. 


termine her or his position in variable 1, and if the process were repeated for vari- 
able 2, a zero or near-zero correlation would result. Under these conditions, it 
would be impossible to predict an individual's relative standing in variable 2 from 
a knowledge of her or his score in variable 1. The top-scoring person in variable 1 
might score high, low, or average in variable 2. By chance, some persons might 
score above average in both variables, or below average in both; others might fall 
above average in one variable and below in the other; still others might be above 
the average in one and at the average in the second, and so forth. There would be 
no regularity in the relationship from one individual to another. 

The coefficients found in actual practice generally fall between these ex- 
tremes, having some value higher than zero but lower than 1.00. Correlations 
between measures of abilities are nearly always positive, although frequently low. 
When a negative correlation is obtained between two such variables, it usually 
results from the way in which the scores are expressed. For example, if time 
scores are correlated with amount scores, a negative correlation will probably re- 
sult. Thus, if each person’s score on an arithmetic computation test is recorded as 
the number of minutes required to complete all items, while the score on an 
arithmetic reasoning test represents the number of problems correctly solved, a 
negative correlation can be expected. In such a case, the poorest (i.e., slowest) 
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individual will have the numerically highest score on the first test, while the best 
individual will have the highest score on the second. 

Correlation coefficients may be computed in various ways, depending on the 
nature of the data. The most common is the Pearson Product-Moment Correlation 
Coefficient. This correlation coefficient takes into account not only the person’s 
position in the group but also the amount of her or his deviation above or below 
the group mean. It will be recalled that when each person’s standing is expressed 
in terms of standard scores, persons falling above the average receive positive 
standard scores, while those below the average receive negative scores. Thus, an 
individual who is superior in both variables to be correlated would have two posi- 
tive standard scores; one inferior in both would have two negative standard 
scores. If, now, we multiply each individual’s standard score in variable 1 by her or 
his standard score in variable 2, all of these products will be positive, provided 
that each person falls on the same side of the mean on both variables. The Pear- 
son correlation coefficient is simply the mean of these products. It will have a 
high positive value when corresponding standard scores are of equal sign and of 
approximately equal amount in the two variables. When persons above the aver- 
age in one variable are below the average in the other, the corresponding cross- 
products will be negative. If the sum of the cross-products is negative, the 
correlation will be negative. When some products are positive and some negative, 
the correlation will be close to zero. 

In actual practice, it is not necessary to convert each raw score to a standard 
score before finding the cross-products, since this conversion can be made once 
for all after the cross-products have been added. There are many shortcuts for 
computing the Pearson correlation. coefficient. The method demonstrated in 
Table 4-1 is not the quickest, butt illustrates the meaning of the correlation co- 
efficient more clearly than other methods that utilize computational shortcuts. 
Table 4-1 shows the computation of a Pearson r between the mathematics and 
reading scores of 10 children. Next to each child's name are her or his scores in 
the mathematics test (X) and the reading test (Y). The sums and means of the 10 
scores are given under the respective columns. The third column shows the devi- 
ation (x) of each mathematics score from the mathematics mean; and the fourth 
column shows the deviation (y) of each reading score from the reading mean. 
These deviations are squared in the next two columns, and the sums of the 
squares are used in computing the standard deviations of the mathematics and 
reading scores by the method described in chapter 3. Rather than dividing each x 
and y by its corresponding SD to find standard scores, we perform this division 
only once at the end, as shown in the correlation formula in Table 4-1. The cross- 
products in the last column (xy) have been found by multiplying the correspond- 
ing deviations in the x and y columns. To compute the correlation (7), the sum of 
these cross-products is divided by the number of cases (N) and by the product of 
the two standard deviations (SD,SD,). 


Statistical Significance. The correlation of .40 found in Table 4—1 indicates 
a moderate degree of positive relationship between the mathematics and reading 
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Computation of Pearson Product-Moment 
Correlation Coefficient 


Pupil X Y x y x y xy 


scores. There is some tendency for those children doing well in mathematics also 
to perform well on the reading test and vice versa, although the relation is not 
close. If we are concerned only with the performance of these 10 children, we can 
accept this correlation as an adequate. description of the degree of relation 
existing between the two variables in this group. In psychological research, 
however, we are usually interested in generalizing rici of 
individuals tested to the larger population they represent. For example, we might 
want to know whether mathematics and reading ability are correlated among 
American schoolchildren of the same age as those we tested. Obviously, the 10 
cases actually examined would constitute a very inadequate sample of such a 
population. Another comparable sample of the same size might yield a much 
lower or a much higher correlation. ‘ i i um 
There are statistical procedures for estimating the probable fluctuation to be 
expected from sample to sample in the size of correlations, means, standard devi- 
ations, and any other group measures. The question usually asked about correla- 
tions, however, is simply whether the correlation is significantly greater than 
zero. In other words, if the correlation in the population is zero, could a correla- 
tion as high as that o! ined in our sample have resulted from sampling error 
alone? When we say that a correlation is “significant at the 1% (.01) level,” we 
mean the chances are no greater than one out of 100 that the population corre- 
lation is zero. Hence, we conclude that the two variables are truly corre ted. 
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Significance levels refer to the risk of error we are willing to take in drawing con- 
clusions from our data. If a correlation is said to be significant at the .05 level, 
the probability of error is 5 out of 100. Most psychological research applies cither 
the .01 or the .05 levels, although other significance levels may be employed for 
special reasons. 

The correlation of .40 found in Table 4-1 fails to reach significance even at the 
-05 level. As might have been anticipated, with only 10 cases it is difficult to es- 
tablish a general relationship conclusively. With this size of sample, the smallest 
correlation significant at the .05 level is .63. Any correlation below that value 
simply leaves unanswered the question of whether the two variables are corre- 
lated in the population from which the sample was drawn. The minimum correla- 
tions significant at the .01 and .05 levels for groups of different sizes can be found 
by consulting tables of the significance of correlations in any statistics textbook. 
For interpretive purposes in this book, however, only an understanding of the 
general concept is required. 


Ben vue. a. 
"One of the subtests of the SRA Tests of Primary M. T? 8 
an investigation by Anastasi and Drake (1954). lental Abilities for Ages 11 to 17. The data were obtained in 
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Figure 4-3. A Reliability Coefficient of .72. 
(Data from Anastasi & Drake, 1954.) 


coefficients, which usually fall in the .80s or .90s. An examination of the scatter 
diagram in Figure 4— shows a typical bivariate distribution of scores correspond- 
ing to a high positive correlation. It will be noted that the tallies cluster close to 
the diagonal extending from the lower left- to the upper right-hand corner; the 
trend is definitely in this direction, although there is a certain amount of scatter 
of individual entries. In the following section, the use of the correlation coeffi- 
cient in computing different measures of test reliability will be considered. 


TYPES OF RELIABILITY 


Test-Retest Reliability. The most obvious method for finding the reliability 
of test scores is by repeating the identical test on a second occasion. The reliability 
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coefficient (r,) in this case Wee the correlation between the scores obtained 
by the same persons on the two of the test. The error variance 
corresponds to the random fluctuations of performance from one test session to 
the other. These variations may result in part from uncontrolled testing 
conditions, such as extreme changes in weather, sudden noises and other 
distractions, or a broken pencil point, To some extent, however, they arise from 
condition of the test takers themselves, as illustrated by illness, 
fatigue, emotional strain, worry, recent experiences of a t or unpleasant 
nature, and the like. Retest reliability shows the extent to which scores on a test 
E EE EEN d ie the less 
are 


scores are to the random daily changes in the condition of the test 
takers or of the testing environment. 
When retest reliability is reported in à test manual, the interval over which it 


was measured should be specified. Since retest correlations decrease progres- 
sry hs eval thee B fot one but an infinite number of reren re. 
coefficients for any test. It is also desirable to give some indication ot 
relevant intervening experiences of the persons on whom reliability was measured, 
Riedchen ie sheet 
tte aber? readily he’ ged of tst: 
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yean. Many preschool intelligence for example, 
yield moderately stable measures within the aa Tine Poema 
useless as predictors of late childhood or adult IQs. [n actual practice, however, a 
simple distinction can usually be made. Shore-range, random fluctuations that 
occur during intervals ranging from a few hours to a few months are generally in 
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ever, should not be confused with that of the reliability of a test. When 
we measure the reliability of the Stanford-Binet, for , we do not ondfinar- 


ily correlate retest scores over a period of tet years, or even one but over à 
few weeks. To be sure, long-range retests have been conducted such tests, but 
the results are generally discussed in terms of the predictability of adult intelli- 
gence from childhood performance, rather than in terms of the reliability of a par- 
ticular test. The concept of reliability is generally restricted to short-range, 
random changes that characterize the test performance itself rather than the en: 
tire behavior domain that is being tested. 

1t should be noted that different behavior functions may themselves vary in 
the extent of daily fluctuation they exhibit. For example, steadiness of delicate 
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The concept of item sampling, or content sampling, underlies not only alter- 
nate-form reliability but also other types of reliability to be discussed shortly. It is 
therefore appropriate to examine it more closely. Most students have probably 
had the experience of taking a course examination in which they felt they had a 
“lucky break" because many of the items covered the very topics they happened 
to have studied most carefully. On another occasion, they may have had the op- 
posite experience, finding an unusually large number of items on areas they had 
failed to review. This familiar situation illustrates error variance resulting from 
content sampling. To what extent do scores on this test depend on factors specifi 
to the particular selection of items? If a different investigator, working indepen- 
dently, were to prepare another test in accordance with the same specifications, 
how much would an individual’s score differ on the two tests? 

Let us suppose that a 40-item vocabulary test has been constructed as a mea- 
sure of general verbal comprehension. Now suppose that a second list of 40 differ- 
ent words is assembled for the same purpose, and that the items are constructed 
with equal care to cover the same range of difficulty as the first test. The differ- 
ences in the scores obtained by the same individuals on these two tests illustrate 
the type of error variance under consideration. Owing to fortuitous factors in the 
past experience of different individuals, the relative difficulty of the two lists will 
vary somewhat from person to person. Thus, the first list might contain a larger 
number of words unfamiliar to individual A than does the second list. The second 
list, on the other hand, might contain a disproportionately large number of words 
unfamiliar to individual B. If the two individuals are approximately equal in their 
overall word knowledge (i.e, in their “true scores”), B will nevertheless excel A 
on the first list, while A will excel B on the second. The relative standing of these 
two persons will therefore be reversed on the two lists, owing to chance differ- 
ences in the selection of items. 

_ Like test-retest reliability, alternate-form reliability should always be accompa- 
nied by a statement of the length of the interval between test administrations, as 
well asa description of relevant intervening experiences. If the two forms are ad- 
ministered in immediate succession, the resulting correlation shows reliability 
across forms only, not across occasions. The error variance in this case represents 
fluctuations in performance from one set of items to another, but not fluctuations 
over time. 
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perimental factor on test performance. The use of several alternate forms also pro- 
vides a means of reducing the possibility of coaching or cheating. 

Although much more widely applicable than test-retest reliability, alternate- 
form reliability also has certain limitations. In the first place, if the behavior func- 
tions under consideration are subject to a large practice effect, the use of alternate 
forms will reduce but not eliminate such an effect. To be sure, if all test takers 
were to show the same improvement with repetition, the correlation between 
their scores would remain unaffected, since adding a constant amount to each 
score does not alter the correlation coefficient. It is much more likely, however, 
that individuals will differ in amount of improvement, owing to extent of previ- 
ous practice with similar material, motivation in taking the test, and other fac- 
tors. Under these conditions, the practice effect represents another source of 
variance that will tend to reduce the correlation between the two test forms. If 
the practice effect is small, reduction will be negligible. 

Another related question concerns the degree to which the nature of the test 
will change with repetition. In certain types of ingenuity problems, for example, 
any item involving the same principle can be readily solved by most persons once 
they have worked out the solution to the first. In such a case, changing the spe- 
cific content of the items in the second form would not suffice to eliminate this 
carry-over from the first form. Finally, it should be added that alternate forms are 
unavailable for many tests, because of the practical difficulties of constructing 
truly equivalent forms. For all these reasons, other techniques for estimating test 
reliability are often required. 


Split-Half Reliability. From a single administration of one form of a test it is 
possible to arrive at a measure of reliability by various split-half procedures. In 
such a way, two scores are obtained for each person by dividing the test into 
equivalent halves. It is apparent that split-half reliability provides a measure of 
consistency with regard to content sampling. Temporal stability of the scores does 
not enter into such reliability, because only one test session is involved. This type 
of reliability coefficient is sometimes called a coefficient of internal consistency, 
since only a single administration of a single form is required. 

To find split-half reliability, the first problem is how to split the test in order to 
obtain the most nearly equivalent halves. Any test can be divided in many differ- 
ent ways. In most tests, the first half and the second half would not be equivalent, 
owing to differences in nature and difficulty level of items, as well as to the cumu- 
lative effects of warming up, practice, fatigue, boredom, and any other factors 
varying progressively from the beginning to the end of the test. A procedure that 
is adequate for most purposes is to find the scores on the odd and even items of. 
the test. If the items were originally arranged in an approximate order of difficulty, 
such a division yields very nearly equivalent half-scores. One precaution to be ob- 
served in making such an odd-even split pertains to groups of items dealing with a 
single problem, such as questions referring to a particular mechanical diagram or 
to a given passage in a reading test. In this case, a whole group of items should be 
assigned intact to one or the other half. Were the items in such a group to be 
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placed in different halves of the test, the similarity of the half-scores would be 
spuriously inflated, since any single error in understanding the problem might af- 
fect items in both halves.? 

Once the two half-scores have been obtained for each person, they may be cor- 
related by the usual method. It should be noted, however, that this correlation ac- 
tually gives the reliability of only a half-test. For example, if the entire test 
consists of 100 items, the correlation is computed between two sets of scores each 
of which is based on only 50 items. In both test-retest and alternate-form reliabil- 
ity, on the other hand, each score is based on the full number of items in the test. 

Other things being equal, the longer a test, the more reliable it will be 3 It is 
reasonable to expect that, with a larger sample of behavior, we can arrive at a 
more adequate and consistent measure. The effect that lengthening or shortening 
a test will have on its coefficient can be estimated by means of the Spearman- 
Brown formula, given below: 
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in which r,, is the estimated coefficient, rẹ is the obtained coefficient, and n is the 
number of times the test is lengthened or shortened. Thus, if the number of test 
items is increased from 25 to 100, n is 4; if it is decreased from 60 to 30, n is 4. 
The Spearman-Brown formula is widely used in determining reliability by the 
split-half method, many test manuals reporting reliability in this form. When ap- 
plied to split-half reliability, the formula always involves doubling the length of 
the test. Under these conditions, it can be simplified as follows: 
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in which ru is the correlation of the half-tests. 

An alternate method for finding split-half reliability was developed by Rulon 
(1939). It requires only the variance of the differences between each person's 
scores on the two half-tests (SD3) and the variance of total scores (SD2); these 


two values are substituted in the followi ich yi iabili 
Kr ae in the following formula, which yields the teliability of 
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It is interesting to note the relationship of this formula to the definition of error 
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‘Lengthening a test, hi ill i 
» however, will increase only i i i : e zs 
Pres pi ies edis gech e D consistency in terms of content sampling, not its stability 


Reliability 111 


variance. Any difference between a person's scores on the two half-tests repre- 
sents irrelevant or error variance. The variance of these differences, divided by 
the variance of total scores, gives the proportion of error variance in the scores. 
When this error variance is subtracted from 1.00, it gives the proportion of “true” 
variance for a specified test use, which is equal to the reliability coefficient. 


Kuder-Richardson Reliability and Coefficient Alpha. A fourth method 
for finding reliability, also utilizing a single administration of a single form, is 
based on the consistency of responses to all items in the test. This interitem 
consistency is influenced by two sources of error variance: (1) content sampling (as 
in alternate-form and split-half reliability); and (2) heterogeneity of the behavior 
domain sampled. The more homogeneous the domain, the higher the interitem 
consistency. For example, if one test includes only multiplication items while 
another comprises addition, subtraction, multiplication, and division items, the 
former test will probably show more interitem consistency than the latter. In the 
latter, more heterogeneous test, one test taker may perform better in subtraction 
than in any of the other arithmetic operations; another may score relatively well 
on the division items, but less well in addition, subtraction, and multiplication; 
and so on. A more extreme example would be represented by a test consisting of 
40 vocabulary, items, in contrast to one containing 10 vocabulary, 10 spatial 
telations, 10 arithmetic reasoning, and 10 perceptual speed items. In the latter 
test, there might be little or no relationship between an individual’s performance 
on the different types of items. 

It is apparent that test scores will be less ambiguous when derived from rela- 
tively homogeneous tests. Suppose that in the highly heterogeneous 40-item test 
cited previously, Smith and Jones both obtain a score of 20. Can we conclude that 
the performances of the two on this test were equal? Not at all. Smith may have 
correctly completed 10 vocabulary items, 10 perceptual speed items, and none of 
the arithmetic reasoning and spatial relations items. In contrast, Jones may have 
teceived a score of 20 by the successful completion of 5 perceptual speed, 5 spatial 
relations, 10 arithmetic reasoning, and no vocabulary items. 

Many other combinations could obviously produce the same total score of 20. 
This score would have a very different meaning when obtained through such dis- 
similar combinations of items. In the relatively homogeneous vocabulary test, on 
the other hand, a score of 20 would probably mean that the test taker had suc- 
ceeded with approximately the first 20 words, if the items were arranged in as- 
cending order of difficulty. He or she might have failed two or three easier words 
and correctly responded to two or three more difficult items beyond the 20th, but 
such individual variations are slight in comparison with those found In a more 
heterogeneous test. 

A highly relevant question in this connection is whether the criterion that the 
test is trying to predict is itself relatively homogeneous or heterogeneous. Although 
homogeneous tests are to be preferred because their scores permit fairly unambigu- 
ous interpretation, a single homogeneous test is obviously not an adequate predictor 
of a highly heterogeneous criterion. Moreover, in the prediction of a heterogeneous 
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criterion, the heterogeneity of test items would not necessarily represent error vari- 
ance. Traditional intelligence tests provide a good example of heterogeneous tests 
designed to predict heterogeneous criteria. In such a case, however, it may be desir- 
able to construct several relatively homogeneous tests, each measuring a different 
phase of the heterogeneous criterion. Thus, unambiguous interpretation of test 
scores could be combined with adequate criterion coverage. 

The most common procedure for finding interitem consistency is that devel- 
oped by Kuder and Richardson (1937). As in the split-half methods, interitem 
consistency is found from a single administration of a single test. Rather than re- 
quiring two half-scores, however, this technique is based on an examination of 
performance on each item. Of the various formulas derived in the original article, 
the most widely applicable, commonly known as “Kuder-Richardson formula 20,” 
is the following: 
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In this formula, ry is the reliability coefficient of the whole test, n is the number of 
items in the test, and SD, is the standard deviation of total scores on the test. The 
only new term in this formula, Xpq, is found by tabulating the proportion of per- 
sons who pass (p) and the proportion who do not pass (q) each item. The product 
of p and q is computed for each item, and these products are then added for all 
items, to give Xpq. Since in the process of test construction b is often routinely 
recorded in order to find the difficulty level of each item, this method of deter- 
mining reliability involves little additional computation. 

It can be shown mathematically that the Kuder-Richardson reliability coeffi- 
cient is actually the mean of all split-half coefficients resulting from different 
splittings of a test (Cronbach, 1951).? The ordinary split-half coefficient, on the 
other hand, is based on a planned split designed to yield equivalent sets of items. 
Hence, unless the test items are highly homogeneous, the Kuder-Richardson co- 
efficient will be lower than the split-half reliability. An extreme example will 
serve to highlight the difference. Suppose we construct a 50-item test out of 25 
different kinds of items such that items 1 and 2 are vocabulary items, items 3 and 
4 arithmetic reasoning, items 5 and 6 spatial orientation, and so on. The odd and 
even scores on this test could theoretically agree quite closely, thus yielding a 
high split-half reliability coefficient. The homogeneity of this test, however, 
would be very low, since there would be little consistency of performance among 
the entire set of 50 items. In this example, we would expect the Kuder-Richard- 
son reliability to be much lower than the split-half reliability. In fact, the differ- 
ence between Kuder-Richardson and split-half reliability coefficients may serve as 
a rough index of the heterogeneity of a test. 
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i The Kuder-Richardson formula is applicable to tests whose items are scored as 
right or wrong, or according to some other all-or-none system. Some tests, how- 
ever, may have multiple-scored items. On a personality inventory, for example, 
the respondent may receive a different numerical score on an item, depending on 
whether he or she checks “usually,” “sometimes,” “rarely,” or “never.” For such 
tests, a generalized formula has been derived, known as coefficient alpha (Cron- 
bach, 1951; Kaiser & Michael, 1975; Novick & Lewis, 1967). In this formula, the 
value Xpq is replaced by X(SD?), the sum of the variances of item scores. The pro- 
cedure is to find the variance of all individuals’ scores for each item and then to 
add these variances across all items. The complete formula for coefficient alpha is 
given as follows: 


«- (2; aeo 
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Scorer Reliability. It should now be apparent that the different types of relia- 
bility vary in the factors they subsume under error variance. In one case, error 
variance covers temporal fluctuations; in another, it refers to differences between 
sets of parallel items; and in still another, it includes any interitem inconsistency. 
On the other hand, the factors excluded from measures of error variance are 
broadly of two types: (a) those factors whose variance should remain in the scores, 
since they are part of the true differences under consideration; and (b) those 
irrelevant factors that can be experimentally controlled. For example, it is not 
customary to report the error of measurement resulting when a test is 
administered under distracting conditions or with a longer or shorter time limit 
than that specified in the manual. Timing errors and serious distractions can be 
empirically eliminated from the testing situation. Hence, it is not necessary to 
report special reliability coefficients corresponding to “distraction variance” or 
“timing variance.” 

Similarly, most tests provide such highly standardized procedures for adminis- 
tration and scoring that error variance attributable to these factors is negligible. 
This is particularly true of group tests designed for mass testing and computer 
scoring. With such instruments. we need only to make certain that the prescribed 
procedures are carefully followed and adequately checked. With clinical instru- 
ments employed in intensive individual examinations, on the other hand, there is 
evidence of considerable examiner variance. Through special experimental de- 
signs, it is possible to separate this variance from that attributable to temporal 
fluctuations in the test taker’s condition or to the use of alternate test forms. 

One source of error variance that can be checked quite simply is scorer variance. 
Certain types of tests—notably tests of creativity and projective tests of personal- 
ity—leave a good deal to the judgment of the scorer. With such tests, there is as 
much need for a measure of scorer reliability as there is for the more usual reliabil- 
ity coefficients. Scorer reliability can be found by having a sample of test papers in- 
dependently scored by two examiners. The two scores thus obtained by each test 
taker are then correlated in the usual way, and the resulting correlation coefficient 
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is a measure of scorer reliability. This type of reliability is commonly computed 
when subjectively scored instruments are employed in research. Test manuals 
should also report it when appropriate. 


Overview. The different types of reliability coefficients discussed in this sec- 
tion are summarized in Tables 4-2 and 4-3. In Table 4-2, the operations followed 
in obtaining each type of reliability are classified with regard to number of test 
forms and number of testing sessions required. Table 4-3 shows the sources of 
variance treated as error variance by each procedure. 

Any reliability coefficient may be interpreted directly in terms of the percentage 
of score variance attributable to different sources. Thus, a reliability coefficient of 
.85 signifies that 85% of the variance in test scores depends on true variance in 
the trait measured, and 1596 depends on error variance (as operationally defined 
by the specific procedure followed). The statistically sophisticated reader may re- 
call that it is the square of a correlation coefficient that represents proportion of 
common variance. Actually, the proportion of true variance in test scores is the 
square of the correlation between scores on a single form.of the test and true 
scores, which are free from chance errors. This correlation, known as the index of 
reliability is equal to the square root of the reliability coefficient (Vr). When 
the index of reliability is itself squared, the result is the original reliability coeffi- 
cient (r4), which can therefore be interpreted directly as the percentage of true 
variance for a designated test use. 

Experimental designs that yield more than one type of reliability coefficient for 
the same group permit the analysis of total score variance into different compo- 
nents. Let us consider the following hypothetical example. Forms A and B of a, 
creativity test have been administered with a two-month interval to 100 sixth- 
grade children. The resulting alternate-form reliability is .70. From the responses 
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Sources of Error Variance in Relation to 
Reliability Coefficients 


Type of Reliability Coefficient Error Variance 

Test-Retest Time sampling 

Alternate-Form (Immediate) = ~ Content sampling 

Alternate-Form (Delayed) Time sampling and content sampling 
Split-Half i ; ‘Content sampling |. ticis REED 
Kuder-Richardson and Coefficient Alpha = Content sampling and content heterogeneity 
Scorer ds Interscorer differences à 


on either form, a split-half reliability coefficient can also be computed. This co- 
efficient, stepped up by the Spearman-Brown formula, is .80. Finally, a second 
scorer has rescored a random sample of 50 papers, from which a scorer reliability 
of .92 is obtained. The three reliability coefficients can now be analyzed to yield 
the error variances shown in Table 4—4 and Figure 4—4. It will be noted that by 
subtracting the error variance attributable to content sampling alone (split-half 
reliability) from the error variance attributable to both content and time sam- 
pling (alternate-form reliability), we find that .10 of the variance can be attrib- 
uted to time sampling alone. Adding the error variances attributable to content 
sampling (.20), time sampling (.10), and interscorer difference (.08) gives a total 
error variance of 28 and hence a true variance of .62. These proportions, ex- 
pressed in the more familiar percentage terms, are shown graphically in Figure 
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"Fora better estimate of the coefficient of internal consistency, split-half correlations could be computed for each 
form and the two coefficients averaged by the appropriate statistical procedures (e.g., Fisher z-transformation). 
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4-4. This sorting out of sources of variance is the essence of the so-called general- 
izability theory of reliability. Complex experimental designs permitting simulta- 
neous assessment of more sources of score variance and the interactions among 
them can be found in detailed topical treatments, such as Brennan (1984), Cron- 
bach et al. (1972), Feldt and Brennan (1989), and Shavelson and Webb (1991). 


RELIABILITY OF SPEEDED TESTS 


Both in test construction and in the interpretation of test scores, an important 
distinction is that between the measurement of speed and of power. A pure speed 
test is one in which individual differences depend entirely on speed of perfor- 
mance. Such a test is constructed from items of uniformly low difficulty, all of 
which are well within the ability level of the persons for whom the test is de- 
signed. The time limit is made so short that no one can finish all the items. Under 
these conditions, each person’s score reflects only the speed with which he or she 
worked. A pure power test, on the other hand, has a time limit long enough to per- 
mit everyone to attempt all items. The difficulty of the items is steeply graded, 
and the test includes some items too difficult for anyone to solve, so that no one 
can get a perfect score. —— 

It will be noted that both speed and power tests are designed to prevent the 
achievement of perfect scores. The reason for such a precaution is that perfect 
scores are indeterminate, since it is impossible to know how much higher the in- 
dividual’s score would have been if more items, or more difficult items, had been 
included. To enable each individual to show fully what he or she is able to accom- 
plish, the test must provide adequate ceiling, either in number of items or in diffi- 
culty level. An exception to this rule is found in mastery testing, as illustrated by 
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Figure 4-4. Percentage Distribution of Score Variance in a Hypothetical Test. 
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the domain-referenced tests discussed in chapter 3. The purpose of such testing is 
not to establish the limits of what the individual can do but to determine whether 
a preestablished performance level has or has not been reached. 

In actual practice, the distinction between speed and power tests is one of de- 
gree, most tests depending on both power and speed in varying proportions. Infor- 
mation about these proportions is needed for each test in order not only to 
understand what the test measures but also to choose the proper procedures for 
evaluating its reliability. Single-trial reliability coefficients, such as those found by 
odd-even or Kuder-Richardson techniques, are inapplicable to speeded tests. To 
the extent that individual differences in test scores depend on speed of perfor- 
mance, reliability coefficients found by these methods will be spuriously high. An 
extreme example will help to clarify this point. Let us suppose that a 50-item test 
depends entirely on speed, so that individual differences in score are based wholly 
on number of items attempted, rather than on errors. Then, if individual A ob- 
tains a score of 44, he or she will obviously have 22 correct odd items and 22 cor- 
rect even items. Similarly, individual B, with a score of 34, will have odd and 
even scores of 17 and 17, respectively. Consequently, except for accidental care- 
less errors on a few items, the correlation between odd and even scores would be 
perfect, or +1.00. Such a correlation, however, is entirely spurious and provides 
no information about the reliability of the test. 

An examination of the procedures followed in finding both split-half and 
Kuder-Richardson reliability will show that both are based on the consistency in 
number of errors made by the examinee. If, now, individual differences in test 
scores depend, not on errors, but on speed, the measure of reliability must obvi- 
ously be based on consistency in speed of work. When test performance depends 
on a combination of speed and power, the single-trial reliability coefficient will 
fall below 1.00, but it will still be spuriously high. As long as individual differ- 
ences in test scores are appreciably affected by speed, single-trial reliability coeffi- 
cients cannot be properly interpreted. 

What alternative procedures are available to determine the reliability of signif- 
icantly speeded tests? If the test-retest technique is applicable, it would be appro- 
priate. Similarly, equivalent-form reliability may be properly employed with speed 
tests. Split-half techniques may also be used, provided that the split is made in 
terms of time rather than in terms of items. In other words, the half-scores must 
be based on separately timed parts of the test. One way of effecting such a split is 
to administer two equivalent halves of the test with separate time limits. For ex- 
ample, the odd and even items may be separately printed on different pages, and 
each set of items given with one-half the time limit of the entire test. Such a pro- 
cedure is tantamount to administering two equivalent forms of the test in imme- 
diate succession. Each form, however, is half as long as the test proper, while the 
test takers’ scores are normally based on the whole test. For this reason, either the 
Spearman-Brown or some other appropriate formula should be used to find the re- 
liability of the whole test. 

If it is not feasible to administer the two half-tests separately, an alternative 
procedure is to divide the total time into quarters and to find a score for each of 
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the four quarters. This can easily be done by having the test takers mark the item 
on which they are working whenever the examiner gives a prearranged signal. 
The number of items correctly completed within the first and fourth quarters can 
then be combined to represent one half-score, while those in the second and 
third quarters can be combined to yield the other half-score. Such a combination 
of quarters tends to balance out the cumulative effects of practice, fatigue, and 
other factors. This method is especially satisfactory when the items are not 
steeply graded in difficulty level. 

When is a test appreciably speeded? Under what conditions must the special 
precautions discussed in this section be observed? Obviously, the mere employ- 
ment of a time limit does not signify a speed test. If all test takers finish within the 
given time limit, speed of work plays no part in determining the scores. Percent- 
age of persons who fail to complete the test might be taken as a crude index of 
speed versus power. Even when no one finishes the test, however, the role of 
speed may be negligible. For example, if everyone completes exactly 40 items of a 
50-item test, individual differences with tegard to speed are entirely absent, al- 
though no one had time to attempt all the items. 

The essential question, of course, is: “To what extent are individual differences 
in test scores attributable to speed?” In more technical terms, we want to know 
what proportion of the total variance in test scores is speed variance. This propor- 
tion can be estimated roughly by finding the variance of number of items com- 
pleted by different persons and dividing it by the variance of total test scores 
(SD2/SD?). In the example cited in the preceding paragraph, in which every indi- 
vidual finishes 40 items, the numerator of this fraction would be zero, since there 
are no individual differences in number of items completed (SD? = 0). The entire 
index would thus equal zero in a pure power test. On the other hand, if the total 
test variance (SD?) is attributable to individual differences in speed, the two vari- 
ances will be equal and the ratio will be 1.00. Several more refined procedures 
have been developed for determining this proportion, but their detailed consider- 
ation falls beyond the scope of this book. 

An example of the effect of speed on single-trial reliability coefficients is pro- 
vided by data collected in an investigation of the first edition of the SRA Tests of 
Primary Mental Abilities for Ages 11 to 17 (Anastasi & Drake, 1954). In this 
study, the reliability of each test was first determined by the usual odd-even proce- 
dure. These coefficients are given in the first row of Table 4—5. Reliability coeffi- 
cients were then computed by correlating scores on separately timed halves. 
These coefficients are shown in the second row of Table 4-5. Calculation of speed 
indexes showed that the Verbal Meaning test was primarily a power test, while 
the Reasoning test was somewhat more dependent on speed. The Space and 
Number tests proved to be highly speeded. It will be noted in Table 4-5 that, 
when it is properly computed, the reliability of the Space test is .75, in contrast to 
a spuriously high odd-even coefficient of .90. Similarly, the reliability of the Rea- 
soning test drops from .96 to .87, and that of the Number test drops from .92 to 
-83. The reliability of the relatively unspeeded Verbal Meaning test, on the other 
hand, shows a negligible difference when computed by the two methods. 
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Reliability Coefficients of Four of the SRA Tests of Primary 
Mental Abilities for Ages 11 to 17 (1st Edition) 


Reliability Coefficient Verbal 
Found by: Meaning Reasoning Space Number 
| Single-trial odd-even method Pa EAM Opt k EN 92 
| Separately timed halves ag E E 5 83 


(Data from Anastasi & Drake, 1954.) 


DEPENDENCE OF RELIABILITY 
COEFFICIENTS ON THE SAMPLE TESTED 


Variability. An important condition influencing the size of a reliability coeffi- 
cient is the nature of the group on which reliability is measured. In the first place, 
any correlation coefficient is affected by the range of individual differences in the 
group. If every member of a group were nearly alike in spelling ability, then the 
correlation of spelling with any other ability would be close to zero in that group. 
It would obviously be impossible, within such a group, to predict an individual’s 
standing in any other ability from a knowledge of her or his spelling score. 

Another, less extreme, example is provided by the correlation between two ap- 
titude tests, such as a verbal comprehension and an arithmetic reasoning test. If 
these tests were administered to a highly homogeneous sample, such as a group of 
300 college sophomores, the correlation between the two would probably be very 
low. Because of restriction of range, there is little relationship, within such a se- 
lected sample of college students, between any individual’s verbal ability and her 
or his numerical reasoning ability. On the other hand, were the tests to be given 
to a heterogeneous sample of 300 persons, ranging from mentally retarded persons 
to college graduates, a high correlation would undoubtedly be obtained between 
the two tests. The mentally retarded examinees would obtain poorer scores than 
the college graduates on both tests, and similar relationships would hold for other 
subgroups within this highly heterogeneous sample. 

Examination of the hypothetical scatter diagram given in Figure 4—5 will fur- 
ther illustrate the dependence of correlation coefficients on the variability, or ex- 
tent of individual differences; within the group. This scatter diagram shows a high 
positive correlation in the entire, heterogeneous group, since the entries are 
closely clustered about the diagonal extending from lower left- to upper right- 
hand corners. If, now, we consider only the subgroup falling within the small rec- 
tangle in the upper right-hand portion of the diagram, it is apparent that the 
correlation between the two variables is close to zero. Individuals falling within 
this restricted range in both variables represent a highly homogeneous group, as 
did the college sophomores mentioned previously. 
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Figure 4-5, The Effect of Restricted Range upon a Correlation Coefficient. 


Like all correlation coefficients, reliability coefficients depend on the variabil- 
ity of the sample within which they are found. Thus, if the reliability coefficient 
reported in a test manual was calculated for a group ranging from fourth-grade 
children to high school students, it cannot be assumed that the reliability would 
be equally high within, let us say, an eighth-grade sample. When a test is to be 
used to discriminate individual differences within a more homogeneous sample 
than the standardization group, the reliability coefficient should be redetermined 
on such a sample. Formulas for estimating the reliability coefficient to be ex- 
pected when the standard deviation of the group is increased or decreased are 
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available in elementary statistics textbooks. It is preferable, however, to recom- 
pute the reliability coefficient empirically on a group comparable to that on 
which the test is to be used. For tests designed to cover a wide range of age or abil- 
ity, the test manual should report separate reliability coefficients for relatively ho- 
mogeneous subgroups within the standardization sample. 


Ability Level. Not only does the reliability coefficient vary with the extent of 
individual differences in the sample, but it may also vary between groups differing 
in average ability level. These differences, moreover, cannot usually be predicted 
or estimated by any statistical formula but can be discovered only by empirical 
tryout of the test on groups differing in age or ability level. Such differences in the 
reliability of a single test may arise in part from the fact that a slightly different 
combination of abilities is measured at different difficulty levels of the test. Or the 
length of the test may vary at different age levels. Even when the number of 
available items is the same, the upper and lower extremes may not provide 
enough items at the appropriate difficulty level to permit individuals to dem- 
onstrate adequately what they are able to do (ceiling or floor effects). In other 
tests, reliability may be relatively low for the younger and less able groups, since 
their scores are unduly influenced by guessing. 

It is apparent that every reliability coefficient should be accompanied by a full 
description of the type of group on which it was determined. Special attention 
should be given to the variability and the ability level of the sample. The reported 
reliability coefficient is applicable only to samples similar to that on which it was 
computed. A desirable and growing practice in test construction is to fractionate 
the standardization sample into more homogeneous subgroups, with regard to age, 
sex, grade level, occupation, and the like, and to report separate reliability coeffi- 
cients for each subgroup. Under these conditions, the reliability coefficients are 
more likely to be applicable to the samples with which the test is to be used in ac- 
tual practice. 


STANDARD ERROR OF MEASUREMENT 


Interpretation of Individual Scores. The reliability of a test may be ex- 
pressed in terms of the standard error of measurement (SEM), also called the 
standard error of a score. This measure is particularly well suited to the in- 
terpretation of individual scores. For many testing purposes, it is therefore more 
useful than the reliability coefficient. The standard error of measurement can be 
easily computed from the reliability coefficient of the test, by the following 
formula: 


SEM = SD,V1 - Te 


in which SD, is the standard deviation of the test scores and rẹ is the reliability 
coefficient, both computed on the same group. For example, if deviation IQs on a 
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particular intelligence test have a standard deviation of 15 and a reliability coeffi- 
cient of .89, the SEM of an IQ on this test is: 15V1 —.89 = 15V.11 = 15(.33) =5. 

To understand what the SEM tells us about a score, let us suppose that we had 
a set of 100 IQs obtained with the above test bya single child, Janet. Because of 
the types of chance errors discussed in this chapter, these scores vary, falling into a 
normal distribution around Janet’s true score. The mean of this distribution of 
100 scores can be taken as the “true score” for a specified test use, and the stan- 
dard deviation of the distribution can be taken as the SEM. Like any standard de- 
viation, this standard error can be interpreted in terms of the normal curve 
frequencies discussed in chapter 3 (see Fig. 3-3). It will be recalled that between 
the mean and lo, there are approximately 68% of the cases in a normal curve. 
Thus, we can conclude that the chances are roughly 2:1 (or 68:32) that Janet's IQ 
on this test will fluctuate between +1 SEM or 5 points on either side of her truc 
IQ. If her true IQ is 110, we would expect her to score between 105 and 115 about 
two thirds (6896) of the time. 

If we want to be more certain of our prediction, we can choose higher odds 
than 2:1. Reference to Figure 3-3 (chap. 3) shows that +36 covers 99.7% of the 
cases. It can be ascertained from normal curve frequency tables that a distance of 
2.586 on either side of the mean includes exactly 99% of the cases. Hence, the 
chances are 99:1 that Janet's IQ will fall within 2.58 SEM, or (2.58) (5) = 13 
points, on either side of her true IQ. We can thus state at the 99% confidence 
level (with only one chance of error out of 100) that Janet’s IQ on any single ad- 
ministration of the test will lie between 97 and 123 (110 = 13 and 110 + 13). If 
Janet were given 100 equivalent tests, her 1Q would fall outside this band of val- 
ues only once. 

In actual practice, of course, we do not have the true scores, but only the scores 
obtained ina single test administration. Under these circumstances, we can apply 
the above reasoning in the reverse direction. If an individual’s obtained score is 
unlikely to deviate by more than 2.58 SEM from her true score, we could argue 
that her true score must lie within 2.58 SEM of her obtained score. Although we 
cannot assign a probability to this statement for any given obtained score, we can 
say that the statement would be correct for 99% of all the cases. On the basis of 
this reasoning Gulliksen (1950, pp- 17-20) proposed that the standard error of 
measurement be used as illustrated above, in order to estimate the reasonable lim- 
its of the true score for persons with any given obtained score. It is in terms of 
such “reasonable limits” that the error of measurement is customarily interpreted 
in psychological testing, and it will be so interpreted in this book.® 

The standard error of measurement and the teliability coefficient are obvi- 
ously alternative ways of expressing test reliability. Unlike the reliability coeffi- 
cient, the error of measurement is independent of the variability of the group 


EEN 
SOther procedures have been proposed that use an estimated "true" i 

à true" score as the center of the confidence interval 
(Dudek, 1979; Glutting, McDermott, & Stanley, 1987). If the reliability coefficient is high, this procedure has 
little effect; if it is low, both true score and size of confidence interval are computed from the same fallible relia- 
bility coefficient. Moreover, the optimal procedure varies with the particular purpose for which the test scores 
are to be used (e.g., for long-term prediction or current performance assessment). 


Reliability 123 


on which it is computed: Expressed in terms of individual scores, it remains un- 
changed when found in a homogeneous or a heterogeneous group. On the other 
hand, being reported in score units, the error of measurement will not be di- 
rectly comparable from test to test. The usual problems of comparability of units 
would thus arise when errors of measurement are reported in terms of arithmetic 
problems, words in a vocabulary test, and the like. Hence, if we want to com- 
pare the reliability of different tests, the reliability coefficient is the better mea- 
sure. To interpret individual scores, the standard error of measurement is more 
appropriate. 

Neither reliability coefficients nor errors of measurement, however, can be 
assumed to remain constant when ability level varies widely. The differences in 
reliability coefficients discussed in the preceding section remain when errors of 
measurement are computed at different levels of the same test. A comprehen- 
sive solution for this problem is provided by the IRT techniques of item analysis 
cited in chapter 3. Spanning a wide ability range, these techniques offer a 
means of expressing the measurement accuracy of a test as a function of the 
level of ability. The procedure yields a test information curve that depends only 
on the items included in the test and permits an estimation of the error of mea- 
surement at each ability level. Further discussion of these techniques is given in 
chapter 7. 

The SEM (or some other index of measurement accuracy) provides a safeguard 
against placing undue emphasis on a single numerical score. So important is this 
application of the SEM that an increasing number of published tests now report 
scores, not as a single number, but as a score band within which the individual’s 
true score is likely to fall. The College Board provides data on the SEM and an 
explanation of its use, not only in materials distributed to high school and college 
counselors, but also in the individual score reports on the SAT sent to test takers. 
The SEM is likewise covered in instructional materials for use in orienting stu- 
dents regarding the meaning of their test scores. Information on SEMs is also 
provided for use in interpreting scores on the Graduate Record Examinations 
(GRE 1995-96 guide). 


Interpretation of Score Differences. ltis particularly important to consid- 
er test reliability and errors of measurement when evaluating the differences 
between two scores. Thinking in terms of the range within which each score may 
fluctuate serves as a check against overemphasizing small differences between 
scores, Such caution is desirable both when comparing test scores of different 
persons and when comparing the scores of the same individual in different 
abilities. Similarly, changes in scores following instruction or other experimental 
variables need to be interpreted in the light of errors of measurement. 

A frequent question about test scores concerns the individual's relative stand- 
ing in different areas. Is Doris more able along verbal than along numerical lines? 
Does Tom have more aptitude for mechanical than for verbal activities? If Doris 
scored higher on the verbal than on the numerical subtests on an aptitude battery 
and Tom scored higher on the mechanical than on the verbal, how sure can we be 
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that they would still do so on a retest with another form of the battery? In othe: 
words, could the score differences have resulted merely from the chance selection 
of specific items in the particular verbal, numerical, and mechanical tests em- 
ployed? These questions are especially relevant to the proper interpretation of 
scores on multiscore batteries in both abilities and personality traits (Anastasi, 
19852). Examples and further discussion of the problems to be considered in in 

terpreting a person's score profile on such batteries can be found in chapters 8 and 
10 (for ability tests) and chapter 13 (for personality tests). 

Because of the growing interest in the interpretation of score profiles, test 
publishers have been developing report forms that permit the evaluation of 
scores in terms of their errors of measurement. An example is the individual re- 
port form for use with the Differential Aptitude Tests, which includes the sort of 
information illustrated in Figure 4—6. On this form, percentile scores on each 
subtest of the battery are plotted as percentile bands, with the obtained per- 
centile at the center. Each percentile bar corresponds to a distance of 1 SEM on 
either side of the obtained score—hence the probability that the individual's 
"true" score falls within the bar is approximately 2 to 1 (.68 to .32). In interpret- 
ing the profiles, test users are advised not to attach importance to differences be- 
tween scores whose percentile bars overlap, especially if they overlap by more 
than half their length. In the profile illustrated in Figure 4—6, for example, the 
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Figure 4-6. Score Profile on the Differential Aptitude Tests, Illustrating Use of 
Percentile Bands. 
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difference between the Verbal Reasoning and Numerical Reasoning scores prob- 
ably reflects a genuine difference in ability level; that between Numerical Rea- 
soning and Abstract Reasoning probably does not; the difference between 
Abstract Reasoning and Mechanical Reasoning is in the doubtful range. 

It is good to bear in mind that the standard error of the difference between two 
scores is larger than the error of measurement of either of the two scores. This fol- 
lows from the fact that this difference is affected by the chance errors present in 
both scores. The standard error of the difference between two scores can be found 
from the standard errors of measurement of the two scores by the following 
formula: 


SE gir, = V(SEMi)? + (SEM) 


in which SEj; is the standard error of the difference between the two scores, and 
SEM; and SEM; are standard errors of measurement of thé separate scores. 
By substituting SD V1 —71 for SEM; and SD V1 — 75; for SEM2, we may rewrite 
the formula directly in terms of reliability coefficients, as follows: 


Bag = SDV2—711 = 722 


In this substitution, the same SD was used for tests 1 and 2, since their scores 
would have to be expressed in terms of the same scale before they could be com- 
pared. 

We may illustrate the above procedure with the Verbal and Performance IQs 
on the Wechsler Adult Intelligence Scale-Revised (WAIS-R). The split-half reli- 
abilities of these scores are .97 and .93, respectively. WAIS-R deviation IQs are 
expressed on a scale with a mean of 100 and an SD of 15. Hence the standard 
error of the difference between these two scores can be found as follows: 


SEa = .15V2 —.97 -.93. AH 


To determine how large a score difference could be obtained by chance at the 05 
level, we multiply the standard error of the difference (4.74) by 1.96. The result is 
9.29, or approximately 10 points. Thus, the difference between an individual's 
WAIS-R Verbal and Performance IQ should be at least 10 points to be significant 
at the .05 level. 


9This formula should not be confused with the formula for the standard error of a difference between two group 
means, which includes a correlation term when the two variables to be compared are correlated. Errors of mea- 
surement in two variables are random errors and hence are assumed to be uncorrelated. 

10More precise estimates can be obtained by using the actual reliabilities and SDs found within each age group. 
When thus computed, the minimum significant Verbal-Performance difference at the .05 level, as reported in 


the test manual, ranges from 8.83 to 12.04. Most of the values, however, are close to 10. 
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RELIABILITY APPLIED 
TO MASTERY TESTING AND CUTOFF SCORES 


It will be recalled from chapter 3 that domain-referenced tests usually (but not 
necessarily) evaluate performance in terms of mastery rather than degree of 
achievement. A major statistical implication of mastery testing is a reduction in 
variability of scores among persons. Theoretically, if everyone continues training 
until the skill is mastered, variability is reduced to zero. In an earlier section of 
this chapter, we saw that any correlation, including reliability coefficients, is af- 
fected by the variability of the group in which it is computed. As the variability of 
the sample decreases, so does the correlation coefficient. Obviously, then, it 
would be inappropriate to assess the reliability of most domain-referenced tests by 
applying the usual procedures to a group of persons after they have reached the 
preestablished mastery level. Under these conditions, even a highly stable and in- 
ternally consistent test could yield a reliability coefficient near zero. 

This apparent difficulty in the assessment of reliability arises from a failure to 
consider what the domain-referenced tests are designed to measure. In actual 
practice, these tests are used essentially to differentiate between those persons 
who have and those who have not acquired the skills and knowledge required for 
a designated activity. The specific purpose for which the test is administered may 
vary widely, from obtaining a driver's license or assignment to an occupational 
specialty to advancing to the next unit in an individualized instructional program 
or admission to a particular course of study. Nevertheless, in all such situations, 
the fact that a test is used at all implies the expectation of variability in perfor- 
mance among individuals. A major portion of this variability reflects individual 
differences in amount of prior training in the relevant functions. 

More than a dozen different techniques have been specifically designed to 
evaluate the reliability of domain-referenced tests (Berk, 1984b; Brennan, 1984; 
Subkoviak, 1984). Some of these techniques are appropriate for simple mas- 
tery-nonmastery decisions, in which all classification errors are considered 
equally serious regardless of their distance from the cutoff score. In such cases, test 

` and retest with parallel forms can be used to find the percentage of persons for 
whom the same decision is reached on both occasions. These data can be further 
analyzed by computing appropriate indexes of agreement and significance values. 
Other procedures take into account the actual scores obtained on the two occa- 
sions and provide indexes that reflect each person's deviation above or below any 
given cutoff score. The choice of a particular procedure should take into account 
the nature and uses of the test, the position of cutoff scores, and other psychome- 
tric features of the test. Relevant considerations have been extensively discussed 
in the technical literature (see Berk, 19842; Feldt & Brennan, 1989). 


Validity: 
Basic Concepts 


he validity of a test concerns what the test measures and how well it does 
so. It tells us what can be inferred from test scores, In this connection, we 
should guard against accepting the test name as an index of what the test 
measures. Test names provide short, convenient labels for identification 
purposes. Most test names are far too broad and vague to furnish meaningful 
clues to the behavior area covered, although increasing efforts are being made 
to use more specific and empirically definable test names. The trait measured by 
a given test can be defined only through an examination of the objective 
sources of information and empirical operations utilized in establishing its va- 
lidity. Moreover, the validity of a test cannot be reported in general terms. No 
test can be said to have “high” or “low” validity in the abstract. Its validity must 
be established with, reference to the particular use for which the test is being 
considered. 

Fundamentally, all procedures for determining test validity are concerned with 
the relationships between performance on the test and other independently ob- 
servable facts about the behavior characteristics under consideration. The specific 
methods employed for investigating these relationships are numerous and have 
been described by various names. These traditional labels focus on different as- 
pects of validity, of special interest for different test uses. As the applications of 
tests developed and expanded, concepts of validity altered accordingly (Anastasi, 
1986a; Messick, 1988, 1989). 
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EVOLVING CONCEPTS OF TEST VALIDITY 


Among the earliest uses of tests was in the assessment of what individuals had 
learned in specified content areas. Today, that use is represented by end-of-course 
examinations in school and by licensing tests for driving a car or qualifying for a 
specified occupation. This type of test, commonly designated as an achievement 
test, is typically evaluated by comparing its content with the content domain it is 
designed to assess. Such a descriptive approach is still relevant in validating tests 
for certain applications, and it will be considered in one section of this chapter. 

As testing moved into its second stage over time, the emphasis shifted to predic- 
tion. How will different persons respond in a given situation, now or at some future 
time? How will this individual react in different specified situations? Performance 
in the situation for which behavior was to be predicted came to be designated as 
the criterion. For this purpose, test validity was usually reported as a correlation co- 
efficient between test scores and a direct and independent measure of that crite- 
rion. This procedure is especially appropriate for the use of tests in the selection or 
placement of individuals for educational programs, for jobs, or for particular treat- 
ment programs. Thus, for a mechanical aptitude test, the criterion might be subse- 
quent job performance as a machinist; for a scholastic aptitude test, it might be 
college grades; and for a neuroticism test, it might be associates’ ratings or other 
available information regarding the individual's behavior in various life situations. 

The current stage in the history of testing reflects two major trends: (1) a 
strengthened theoretical orientation and (2) a close linkage between psychologi- 
cal theory and verification through empirical and experimental hypothesis testing. 
These trends are evident in test construction and validation, as they are in other 
areas of psychology as a whole (Anastasi, 19922, 1992b, 1995). One result of these 
trends is a growing recognition of the value of constructs in describing and under- 
standing human behavior. Constructs are broad categories, derived from the com- 
mon features shared by directly observable behavioral variables. They are 
theoretical entities, not themselves directly observable. Interest in constructs led 
to the introduction of what was first considered to be a third kind of test validity, 
namely construct validity (AERA, APA, NCME, 1985; APA, AERA, NCME, 
1974; Cronbach & Meehl, 1955). Eventually, construct validity came to be recog- 
nized as the fundamental and all-inclusive validity concept, insofar as it specifies 
what the test measures. Content and predictive validation procedures are among 
the many sources of information that contribute to the definition and the under- 
standing of the constructs assessed by a test. At the same time, they provide valu- 
able information in their own right and are preeminent in the evaluation of tests 
for certain uses, Hence the concepts (and terms) have survived, despite their inte- 
gration into the comprehensive concept of construct validity. 


CONTENT-DESCRIPTION PROCEDURES 


Nature. Content-description validation procedures involve essentially the sys- 
tematic examination of the test content to determine whether it covers a repre- 
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sentative sample of the behavior domain to be measured. Such a validation 
procedure is commonly used in tests designed to measure how well the individual 
has mastered a specific skill or course of study. It might seem that mere inspection 
of the content of the test should suffice to establish its validity for such a purpose. 
A test of multiplication, spelling, or bookkeeping would seem to be valid by defi- 
nition if it consists of multiplication, spelling, or bookkeeping items, respectively. 

The solution, however, is not so simple as it appears to be. One difficulty is that 
of adequately sampling the item universe. The behavior domain to be tested tnust 
be systematically analyzed to make certain that all major aspects are covered by 
the test items, and in the correct proportions. For example, a test can easily be- 
come overloaded with those aspects of the field that lend themselves more readily 
to the preparation of objective items. The domain under consideration should be 
fully described in advance, rather than being defined after the test has been pre- 
pared. A well-constructed educational test, for example, should cover the objec- 
tives of instruction, not just its subject matter. Content must therefore be broadly 
defined to include major objectives, such as the application of principles and the 
interpretation of data, as well as factual knowledge. Moreover, validity depends 
on the relevance of the individual's test responses to the behavior area under con- 
sideration, rather than on the apparent relevance of item content. Mere inspec- 
tion of the test may fail to reveal the processes actually used by examinees in 
taking the test. 

It is also important to guard against any tendency to overgeneralize regarding 
the domain sampled by the test. For instance, a multiple-choice spelling test may 
measure the ability to recognize correctly and incorrectly spelled words. But it 
cannot be assumed that such a test also measures ability to spell correctly from 
dictation, frequency of misspellings in written compositions, and other aspects of 
spelling ability (Ahlstróm, 1964; Knoell & Harris, 1952). Still another difficulty 
arises from the possible inclusion of irrelevant factors in the test scores. For exam- 
ple, a test designed to measure proficiency in mathematics or mechanics may be 
unduly influenced by the ability to understand verbal directions or by speed of 
performing simple, routine tasks. 


Specific Procedures: Content validity is built into a test from the outset 
through the choice of appropriate items. For educational tests, the preparation of 
items is preceded by a thorough and systematic examination of relevant course 
syllabi and textbooks, as well as by consultation with subject-matter experts. On 
the basis of the information thus gathered, test specifications are drawn up for the 
item writers. These specifications should show the content areas or topics to be 
covered, the instructional objectives or processes to be tested, and the relative 
importance of individual topics and processes. The final specifications should 
indicate the number of items of each kind to be prepared for each topic. For 
example, assessment of reading ability may include understanding vocabulary in 
context, literal comprehension of content, and drawing correct inferences from 
the given information. It may also sample material from different sources, such as 
essays, poems, newspaper articles, or instructions for operating equipment. A 
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mathematics test may cover computational skills, the solution of verbally pre- 
sented problems, and the application of learned processes to new and unfamiliar 
contexts. 

The discussion of content validation in the manual of an educational achieve- 
ment test should include a description of the procedures followed in ensuring that 
the test content is appropriate and representative. If subject-matter experts par- 
ticipated in the test-construction process, their number and professional qualifi- 
cations should be stated. If they served as judges in classifying items, the 
directions they were given should be reported, as well as the extent of agreement 
among judges. Because curricula and course content change over time, it is partic- 
ularly desirable to give the dates when subject-matter experts were consulted. In- 
formation should likewise be provided about number and nature of course syllabi 
and textbooks surveyed, including publication dates. 

A number of empirical procedures are also commonly followed in order to sup- 
plement the content validation of an educational achievement test. Both total 
scores and performance on individual items can be checked for grade progress. In 
general, those items are retained that show the largest gains in the percentages of 
children passing them from the lower to the upper grades. Other supplementary 
procedures that may be employed, when appropriate, include analyses of types of 
errors commonly made on a test and observation of the work methods employed 
by test takers. The latter could be done by testing students individually with in- 
structions to "think aloud" while solving each problem. The contribution of 
speed can be checked by noting how many persons fail to finish the test or by one 
of the more refined methods discussed in chapter 4. To detect the possible irrele- 
vant influence of ability to read instructions on test performance, scores on the 
test may be correlated with scores on a reading comprehension test. On the other 
hand, if the test is designed to measure reading comprehension, giving the ques- 
tions without the reading passage on which they are based will show how many 
could be answered simply from the test takers’ prior information or other irrele- 
vant cues (Scherich & Hanna, 1977). 


Applications. Especially when bolstered by such empirical checks as those 
already illustrated, content validation provides an adequate technique for 
evaluating achievement tests. It permits us to answer two questions that are basic 
to the validity of educational and occupational achievement tests: (1) Does the 
test cover a representative sample of the specified skills and knowledge? (2) Is test 
performance reasonably free from the influence of irrelevant variables? Content 
validation is particularly appropriate for the domain-referenced tests described in 
chapter 3. Because performance on those tests is interpreted in terms of content 
meaning, it is obvious that content validation is a prime requirement for their 
effective use. Nevertheless, other types of validity evidence are also relevant to a 
full evaluation of the effectiveness of such tests (see Hambleton, 1984b). 
Content validation is also applicable to certain occupational tests designed for 
employee selection and classification, discussed in chapter 17. This type of valida- 
tion evidence is suitable when the test is an actual job sample or otherwise calls 


Validity: Basic Concepts 131 


for the same skills and knowledge required on the job. In such cases, a thorough 
job analysis should be carried out in order to demonstrate the close resemblance 
between the job activities and the test. A clear, step-by-step account of the appli- 
cation of these validation procedures to the development of an industrial reading 
test is illustrated by Schoenfeldt, Schoenfeldt, Acker, and Perlson (1976). Work- 
ing closely with job incumbents and supervisors, the investigators surveyed the 
reading requirements of entry-level jobs in a large manufacturing company in 
terms of both subject matter and comprehension skills. Test items were then con- 
structed to match these requirements. This approach is used widely in developing 
tests for government employees at federal and state levels (Hardt, Eyde, Primoff, 
& Tordy, 1981; Menne, McCarthy, & Menne, 1976; Primoff & Eyde, 1988; Tordy, 
Eyde, Primoff, & Hardt, 1976). 

For aptitude and personality tests, on the other hand, content validation is 
usually inappropriate and may, in fact, be misleading. Although considerations of 
relevance and representativeness of content must obviously enter into the initial 
stages of constructing any test, eventual validation of aptitude or personality tests 
requires empirical verification by the procedures to be described in the following 
sections. These tests bear less intrinsic resemblance to the behavior domain they 
are trying to sample than do achievement tests. Consequently, the content of ap- 
titude and personality tests can do little more than reveal the hypotheses that led 
the test constructor to choose a certain type of content for measuring a specified 
trait. Such hypotheses need to be empirically confirmed to establish the validity 
of the test. 

Unlike achievement tests, aptitude and personality tests are not based on a 
specified course of instruction or uniform set of prior experiences from which test 
content can be drawn. Hence, in the latter tests, individuals are likely to vary 
more in the work methods or psychological processes employed in responding to 
the same test items. The identical test might thus measure different functions in 
different persons. Under these conditions, it would be virtually impossible to de- 
termine the psychological functions measured by the test from an inspection of its 
content. For example, college graduates might solve a problem in verbal or math- 
ematical terms, while a mechanic would arrive at the same solution in terms of 
spatial visualization. Or a test measuring arithmetic reasoning among high school 
freshmen might measure only individual differences in speed of computation 
when given to college students. 


Face Validity. Content validity should not be confused with face validity. The 
latter is not validity in the technical sense; it refers, not to what the test actually 
measures, but to what it appears superficially to measure. Face validity pertains to 
whether the test "looks valid" to the examinees who take it, the administrative 
personnel who decide on its use, and other technically untrained observers. 
Fundamentally, the question of face validity concerns rapport and public 
relations. Although common usage of the term “validity” in this connection may 
make for confusion, face validity itself is a desirable feature of tests. For example, 
when tests originally designed for children and developed within a classroom 
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setting were first extended for adult use, they frequently met with resistance and 
criticism because of their lack of face validity. Certainly if test content appears 
irrelevant, inappropriate, silly, or childish, the result will be poor cooperation, 
regardless of the actual validity of the test. Especially in adult testing, it is not 
sufficient for a test to be objectively valid. It also needs face validity to function 
effectively in practical situations. Face validity also affects the acceptability of a 
test in legislative and judicial decisions, as well as its evaluation by the general 
public. 

In a direct and innovative research program on testing as viewed by the test 
taker (cited in chap. 1), Baruch Nevo and his associates included the investiga- 
tion.of face validity (B. Nevo, 1985, 1992; B. Nevo & Sfez, 1985). First, they 
called attention to the paucity of available research on face validity, despite its 
probable contribution to prevalent attitudes toward tests. They then proposed a 
quantitative assessment of face validity by having test takers and other psychome- 
trically unsophisticated interested persons rate the suitability of a test for its in- 
tended use; the same procedures may be employed in rating individual test items 
or a battery of tests. Illustrative data were reported from an analysis of the re- 
sponses on an Examinee Feedback Questionnaire completed by 1,385 Israeli stu- 
dents who had taken a six-test university entrance examination. The results 
showed promising interrater agreement, retest reliability, and differentiation 
among tests and among subgroups of respondents planning to major in different 
fields. It was recommended that both qualitative and quantitative data about face 
validity be regularly reported in test manuals. 

Face validity can often be improved by merely reformulating test items in 
terms that appear relevant and plausible in the particular setting in which they 
will be used. For example, if a test of simple arithmetic reasoning is constructed 
for use with machinists, the items should be worded in terms of machine opera- 
tions rather than in terms of "how many oranges can be purchased for 86 cents" or 
other traditional schoolbook problems. Similarly, an arithmetic test for naval per- 
sonnel can be expressed in naval terminology, without necessarily altering the 
functions measured. To be sure, face validity should never be regarded as a substi- 
tute for objectively determined validity. It cannot be assumed that improving the 
face validity of a test will improve its objective validity. Nor can it be assumed 
that when a test is modified so as to increase its face validity, its objective validity 
remains unaltered. The validity of the test in its final form should always be di- 
rectly checked. 


CRITERION-PREDICTION PROCEDURES 


Concurrent. and Predictive Validation. Criterion-prediction validation 
procedures indicate the effectiveness of a test in predicting an. individual's 
performance in specified activities. The criterion measure against which test 
scores are validated may be obtained at approximately the same time as the test 
scores or after a stated interval. The 1985 Testing Standards differentiated between 
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concurrent and predictive validation on the basis of these time relations between 
criterion and test. The term "prediction" can be used in the broader sense, to refer 
to prediction from the test to any criterion situation, or in the more limited sense of 
prediction over a time interval. It is in the latter sense that it is used in the 
expression "predictive validity." The information provided by predictive validation 
is most relevant to tests used in the selection and classification of personnel. Hiring 
job applicants, selecting students for admission to college or professional schools, 
and assigning military personnel to occupational training programs represent 
examples of the sort of decisions requiring a knowledge of the predictive validity of 
tests. Other examples include the use of tests to screen out applicants likely to 
develop emotional disorders in stressful environments and the use of tests to 
identify psychiatric patients most likely to benefit from a particular therapy. 

In a number of instances, concurrent validation is employed merely as a substi- 
tute for predictive validation. It is frequently impracticable to extend validation 
procedures over the time required for predictive validation or to obtain a suitable 
preselection sample for testing purposes. As a compromise solution, therefore, 
tests are administered to a group on whom criterion data are already available. 
Thus, the test scores of college students may be compared with their cumulative 
grade-point average at the time of testing, or those of employees compared with 
their current job success: 

For certain uses of psychological tests, on the other hand, concurrent predic- 
tion is the most appropriate type and can be justified in its own right. The logical 
distinction between predictive and concurrent validation is based not on time but 
on the objectives of testing. Concurrent validation is relevant to tests employed 
for diagnosis of existing status, rather than prediction of future outcomes. The dif- 
ference can be illustrated by asking, “Does Smith qualify as a satisfactory pilot?” or 
“Does Smith have the prerequisites to become a satisfactory pilot?” The first ques- 
tion calls for concurrent validation; the second, for predictive validation. 

Because the criterion for concurrent validation is always available at the time 
of testing, we might ask what function is served by the test in such situations. Ba- 
sically, such tests provide a simpler, quicker, or less expensive substitute for the 
criterion data. For example, if the criterion consists of continuous observation of 
a patient during a two-week hospitalization period, a test that could sort out nor- 
mals from disturbed and doubtful cases would appreciably reduce the number of 
persons requiring such extensive observation. 


Criterion Contamination. An essential precaution in finding the validity of 
a test is to make certain that the test scores do not themselves influence any 
individual's criterion status. For example, if a college instructor or a supervisor in 
an industrial plant knows that a particular individual scored very poorly on an 
aptitude test, such knowledge might influence the grade given to the student or 
the rating assigned to the worker. Or a high-scoring person might be given the 
benefit of the doubt when academic grades or on-the-job ratings are being 
prepared. Such influences would obviously raise the correlation between test 
scores and criterion in a manner that is entirely spurious or artificial. 
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This possible source of error in test validation is known as criterion contamina- 
tion, since the criterion ratings become "contaminated" by the rater's knowledge 
of the test scores. To prevent the operation of such an error, it is absolutely essen- 
tial that no person who participates in the assignment of criterion ratings have 
any knowledge of the examinees’ test scores. For this reason, test scores employed 
in “testing the test” must be kept strictly confidential. It is sometimes difficult to 
convince teachers, employers, military officers, and other line personnel that such 
a precaution is essential. In their urgency to utilize all available information for 
practical decisions, such persons may fail to realize that the test scores must be put 
aside until the criterion data mature and validity can be checked. 


Criterion Measures. A test may be validated against as many criteria as there 
are specific uses for it. Any method for assessing behavior in any situation could 
provide a criterion measure for some particular purpose. The criteria employed in 
finding the validities reported in test manuals, however, fall into a few common 
categories, Among the criteria most frequently employed in validating intelli- 
gence tests is some index of academic achievement. It is for this reason that such 
tests have often been more precisely described as measures of scholastic aptitude. 
The specific indices used as criterion measures include school grades, achieve- 
ment test scores, promotion and graduation records, special honors and awards, 
and teachers’ or instructors’ ratings for “intelligence.” Insofar as such ratings 
given within an academic setting are likely to be heavily colored by the individu- 
al’s scholastic performance, they may be properly classified with the criterion of 
academic achievement, 

The various indices of academic achievement have provided criterion data at 
all educational levels, from the primary grades to college and graduate school. Al- 
though employed principally in the validation of general intelligence tests, they 
have also served as criteria for certain multiple-aptitude and personality tests. In 
the validation of any of these types of tests for use in the selection of college stu- 
dents, for example, a common criterion is freshman grade-point average. This 
measure is the average grade in all courses taken during the freshman year, each 
grade being weighted by the number of course points for which it was received. 


underlying this criterion is that the educational ladder serves as a progressively se- 
lective influence, eliminating those incapable ot continuing beyond each step. 
Although it is undoubtedly true that college gracluates, for example, represent a 


at the higher educational levels, economic, social, motivational, and other non- 
intellectual factors may influence the continuation of the individual's education. 
Moreover, with such concurrent validation, it is difficult to disentangle cause- 


» 
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test scores simply the result of the varying amount of education? And to what ex- 
tent could the test have predicted individual differences in subsequent educa- 
tional progress? These questions can be answered only when the test is 
administered before the criterion data have matured, as in predictive validation. 

In the development of special aptitude tests, a frequent type of criterion is 
based on performance in specialized training. For example, mechanical aptitude tests 
may be validated against final achievement in shop courses. Various business 
school courses, such as typing or bookkeeping, provide criteria for aptitude tests 
in these areas. Similarly, performance in music or art schools has been employed 
in validating music or art aptitude tests. Several professional aptitude tests have 
been validated in terms of achievement in schools of law, medicine, dentistry 
engineering, and other areas. In the case of custom-made tests, designed for use 
within a specific testing program, training records are a frequent source of 
criterion data. An outstanding illustration is the validation of Air Force pilot- 
selection tests against performance in basic flight training. Performance in train- 
ing programs is also commonly used as a criterion for test validation in other 
military occupational specialties and in some industrial validation studies. 

Among the specific indices of training performance employed for criterion pur- 
poses may be mentioned achievement tests administered on completion of train- 
ing, formally assigned grades, instructors' ratings, and successful completion of 
training versus elimination from the program. Multiple aptitude batteries have 
often been checked against grades in specific high school or college courses, in 
order to determine their validity as differential predictors. For example, scores on 
a verbal comprehension test may be compared with grades in English courses, spa- 
tial visualization scores with geometry grades, and so forth. 

In connection with the use of training records in genera] as criterion measures, 
a useful distinction is that between intermediate and ultimate criteria. In the 
development of an Air Force pilot-selection test or a meclical aptitude test, for 
example, the ultimate criteria would be combat performance and eventual 
achievement as a practicing physician, respectively. Obviously, it would require a 
long time for such criterion data to mature. It is doubtful, moreover, whether a 
truly ultimate criterion is ever obtained in actual practice. Finally, even were such 
an ultimate criterion available, it would probably be subject to many uncon- 
trolled factors that would render it relatively useless. For example, it would be dif- 
ficult to evaluate the relative degree of success of physicians practicing different 
specialties and in different parts of the country. For these reasons, such intermedi- 
ate criteria as performance records at some stage of training are frequently em- 
ployed as criterion measures. 

For many purposes, the most satisfactory type of criterion measure is that based 
on follow-up records of actual job performance. This criterion has been used to 
some extent in the validation of general intelligence as well as personality tests, 
and toa larger extent in the validation of special aptitude tests. It is a common cri- 
terion in the validation of custom-made tests for specific jobs. The “jobs” in ques- 
tion may vary widely in both level and kind, including work in business, industry, 
the professions, and the armed services. Most measures of job performance, 
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although probably not representing ultimate criteria, at least provide good inter- 
mediate criteria for many testing purposes. In this respect, they are to be preferred 
to training records. On the other hand, the measurement of job performance does 
not permit as much uniformity of conditions as is possible during training. More- 
over, since it usually involves a longer follow-up, the criterion of job performance 
is likely to entail a loss in the number of available participants. Because of the vari- 
ation in the nature of nominally similar jobs in different organizations, test manu- 
als reporting validity data against job criteria should describe not only the specific 
criterion measures employed but also the job duties performed by the workers. 

Validation by the method of contrasted groups generally involves a composite 
criterion that reflects the cumulative and uncontrolled selective influences of 
everyday life. This criterion is ultimately based on survival within a particular 
group versus elimination therefrom. For example, the validity of a musical aptitude 
or a mechanical aptitude test may be checked by comparing the scores obtained by 
students enrolled in a music school or an engineering school, respectively, with the 
scores of unselected high school or college students. Of course, contrasted groups 
could be selected on the basis of any criterion, such as school grades, ratings, or job 
performance, by simply choosing the extremes of the distribution of criterion mea- 
sures. The contrasted groups included in the present category, however, are dis- 
tinct groups that have gradually become differentiated through the operation of 
the multiple demands of daily living. The criterion under consideration is thus 
more complex and less clearly definable than those previously discussed. 

The method of contrasted groups is used quite commonly in the validation of 
personality tests. Thus, in validating a test of social traits, the test performance of 
salespersons or executives, on the one hand, may be compared with that of clerks 
or engineers, on the other. The assumption underlying such a procedure is that, 
with reference to many social traits, individuals who have entered and remained in 
such occupations as selling or executive work will as a group excel persons in such 
fields, as clerical work or engineering. Similarly, college students who have en- 
gaged in many extracurricular activities may be compared with those who have 
participated in none during a comparable period of college attendance. Occupa- 
tional groups have frequently been used in the development and validation of in- 
terest tests, such as the Strong Vocational Interest Blank (SVIB), as well as in the 
preparation of attitude scales. Other groups sometimes employed in the validation 
of attitude scales include political, religious, geographical, or other special groups 
generally known to represent distinctly different points of view on certain issues. 

In the empirical validation of domain-referenced tests, several adaptations of 
the method of contrasted groups have been employed, in addition to the usual 
content-validation procedures (Hambleton, 1984b). For this purpose, groups dif- 
fering in amount of relevant instruction are compared in test performance. If mas- 
tery scoring is employed, a 2 x 2 analysis can be made, in which the proportion of 
pass and fail scores in the preinstruction group is compared with the proportion of 
pass and fail scores in the postinstruction group (Panell & Laabs, 1979). Similar 

comparisons can be made if the test is administered to schoolchildren in one 
grade below and one grade above the grade where the particular concept or skill 
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assessed by the test is taught. If scores are available after several different periods 
of instruction, à correlation can be found between: actual performance and 
amounts of instruction. o 

In the development of certain personality tests, psychiatric diagnosis is used both 
as a basis for the selection of items and as evidence of test validity. Psychiatric di- 
agnosis may serve as a satisfactory criterion provided that it is based on prolonged 
observation and detailed case history, rather than on a cursory psychiatric inter- 
view or examination. In the latter case, there is no reason to expect the psychi- 
atric diagnosis to be superior to the test score itself as an indication of the 
individual's emotional condition. Such a psychiatric diagnosis could not be re- 
garded as a criterion measure, but rather as an indicator or predictor whose own 
validity would have to be determined. 

Mention has already been made, in connection with other criterion categories, 
of certain types of ratings by schoolteachers, instructors in specialized courses, and 
job supervisors. To these can be added ratings by officers in military situations, 
ratings of students by school counselors, and ratings by co-workers, classmates, 
fellow club members, and other groups of associates. The ratings discussed earlier 
represented merely a subsidiary technique for obtaining information regarding 
such criteria as academic achievement, performance in specialized training, or job 
success. We are now considering the use of ratings as the very core of the criterion 
measure. Under these circumstances, the ratings themselves define the criterion. 
Moreover, such ratings are not restricted to the evaluation of specific achieve- 
ment but involve a personal judgment by an observer regarding any of the variety 
of traits that psychological tests attempt to measure. Thus, the participants in the 
validation sample might be rated on such characteristics as dominance, mechani- 
cal ingenuity, originality, leadership, or honesty. 

Ratings have been employed in the validation of almost every type of test. 
They are particularly useful in providing criteria for personality tests, since objec- 
tive criteria are much more difficult to find in this area. This is especially true of 
distinctly social traits, in which ratings based on personal contact may constitute 
the most logically defensible criterion. Although ratings may be subject to many 
judgmental errors, when obtained under carefully controlled conditions they rep- 
resent a valuable source of criterion data. Techniques for improving the accuracy 
of ratings and for reducing common types of errors are considered in chapter 16. 

Finally, correlations between a new test and previously available tests are fre- 
quently cited as evidence of validity. When the new test is an abbreviated or sim- 
plified form of a currently available test, the latter can properly be regarded as a 
criterion measure. Thus, a paper-and-pencil test might be validated against a 
more elaborate and time-consuming performance test whose validity had previ- - 
ously been established. Or a group test might be validated against an individual 
test. The Stanford-Binet, for example, has repeatedly served as a criterion in vali- 
dating group tests. In such a case, the new test may be regarded at best as a crude 
approximation of the earlier one. It should be noted that unless the new test rep- 
resents a simpler or shorter substitute for the earlier test, the use of the latter as a 
criterion is inappropriate. 
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A major development in test construction in the 1980s and 1990s focused at- 
tention on criterion analysis. This is an aspect of test development that had com- 
monly been neglected in traditional test research. Over the years, scattered voices 
were raised to urge the need for systematic criterion research, but practical imple- 
mentation was meager (L. R. Jaines, 1973; Tenopyr, 1986). Even in well-designed 
test construction projects that used careful job analyses as a guide for test develop- 
ment, the results of the job analyses had little effect on the criterion measure used 
in the subsequent validation of the tests. The criterion was usually accepted as 
“just being there” and was all too often represented by a single global index of job 
performance based on ratings or output records. 

It is now being widely recognized that the validity of a test can be most effec- 
tively investigated by identifying the major constructs in the performance of the 
given job and then choosing or developing tests whose scores assess the needed 
constructs (J. P. Campbell, 1990a; . J. P. Campbell, McHenry, & Wise, 1990; 
L. V. Jones & Appelbaum, 1989; Messick, 1995). An outstanding example of the 
application of thorough criterion research as a first step in the development of a 
test battery is provided by the U.S. Army’s Selection and Classification Project, 
commonly known as “Project A” (J. P. Campbell, 1990b). Because of its general 

relevance to the industrial and organizational use of tests, this large-scale, seven- 
year project is cited more fully in chapter 17. 


Validity Generalization. Criterion-prediction validity is often used in local 
validation studies, in which the effectiveness of a test for a specific program is to 
be assessed. This is the approach followed, for example, when a given company 
wishes to evaluate a test for selecting applicants for one of its jobs or when a given 
college wishes to determine how well an academic aptitude test can predict the 
course performance of its students. Criterion-prediction validity can be best 
characterized as the practical validity of a test for a specified purpose. 

When standardized aptitude tests were first correlated with performance on pre- 
sumably similar jobs in industrial validation studies, the validity coefficients were 
found to vary widely (Ghiselli, 1959, 1966). Similar variability among validity 
coefficients was observed when the criteria were grades in various school courses 
(G. K. Bennett, Seashore, & Wesman, 1984). Such findings led to widespread pes- 
simism regarding the generalizability of test validity to different situations. Until 
the mid-1970s, “situational specificity" of psychological requirements was generally 
regarded as a serious limitation in the usefulness of standardized tests in personnel 
selection. In a sophisticated statistical analysis of the problem, however, Schmidt, 
Hunter, and their associates demonstrated that much of the variance among ob- 
tained validity coefficients may be a statistical artifact resulting from small sample 
size, criterion unreliability, and restriction of range in employee samples. ! 


!This work was part of a continuing research program reported in many articles and monographs, The publica- 
tions most relevant to the present topic include (but are not limited to) Pearlman, Schmidt, and Hunter (1980), 
Schmidt, Gast-Rosenberg, and Hunter (1980), Schmidt and Hunter (1977), Schmidt; Hunter, and Pearlman 
(1981), and Schmidt, Hunter, Pearlman, and Shane ( 1979). 
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The industrial samples available for test validation are generally too small to 
yield a stable estimate of the correlation between predictor and criterion. For the 
same reason, the obtained coefficients may be too low to reach statistical signifi- 
cance in the sample employed and may thus fail to provide evidence of the test's 
validity. It has been estimated that about half of the validation samples used in in- 
dustrial studies include no more than 40 or 50 cases (Schmidt, Hunter, & Urry, 
1976). With such small samples, the application of criterion-prediction valida- 
tion is not technically feasible. 

Applying their newly developed techniques to data from many samples drawn 
from a large number of occupational specialties, Schmidt, Hunter, and their co- 
workers were able to show that the validity of tests of verbal, numerical, and rea- 
soning aptitudes can be generalized far more widely across occupations than had 
previously been recognized. The variance of validity coefficients typically found 
in earlier industrial studies proved to be no greater than would be expected by 
chance. This was true even when the particular job functions appeared to be 
quite dissimilar across jobs. Evidently, the successful performance of a wide vari- 
ety of occupational tasks depends to a significant degree on a common core of 
cognitive skills. The tests included in these studies covered chiefly the type of 
content and skills sampled in traditional intelligence and scholastic aptitude 
tests. It would seem that this cluster of cognitive skills and knowledge is broadly 
predictive of performance in both academic and occupational activities de- 
manded in advanced technological societies. Nevertheless, more precise selection 
decisions can generally be reached by considering scores on two or three broad 
cognitive clusters, preferably supplemented by measures of technical task-specific 
skills for particular jobs (Hartigan & Wigdor, 1989; L. L. Wise, McHenry, & 
Campbell, 1990; Zeidner & Johnson, 1991). 


Meta-Analysis. The statistical procedures employed in investigating validity 
generalization provide essentially a way of integrating findings from different 
studies. They may combine data from past and present research, from studies 
conducted in different places, or from information available in published studies. 
Although first designated as meta-analysis and introduced in psychological 
research in the 1970s (Glass, 1976; Schmidt & Hunter, 1977), the basic 
procedures had been in use for several decades, especially in other sciences 
(Hartigan & Wigdor, 1989, chap. 6). Meta-analysis received increasing attention 
in psychology as a substitute for the traditional literature survey (Lipsey & 
Wilson, 1993; Schmidt, 1992). Such surveys typically reported the number of 
studies that found statistically significant effects in, for example, differences 
between means of experimental and control groups or correlations between test 
scores and other variables. With this traditional procedure, promising positive 
results were often lost because the samples used in single studies were too small to 
yield significant differences. ; 
By combining the published findings of several studies and weighting them in- 
sofar as possible on the basis of relevant methodological and substantive features 
of each study, meta-analysis may reveal substantial positive findings. A further 
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advantage is that meta-analysis permits the computation of effect sizes. For both 
theoretical and practical purposes, the estimated magnitude of a difference or a 
correlation is more useful than is the simple demonstration that it is significantly 
greater than zero. 

The decades of the 1980s and 1990s witnessed a rapid upsurge of meta-analytic 
research in almost every field of psychology. Its applications in occupational selec- 
tion and classification research probably attracted the widest attention (see chap. 
17). Interest in meta-analysis is steadily increasing, and procedural refinements 
are constantly being developed. Although some controversy remains about spe- 
cific techniques, the major results do not vary appreciably across procedures.’ 


CONSTRUCT-IDENTIFICATION PROCEDURES 


The term “construct validity” was officially introduced into the psychometrist’s 
lexicon in 1954 in the Technical Recommendations for Psychological Tests and Diag- 
nostic Techniques (APA, 1954), which constituted the first edition of the current 
Testing Standards. The first detailed exposition of construct validity appeared the 
following year in an article by Cronbach and Meehl (1955). The discussions of 
construct validation that followed—and that are continuing with undiminished 
vigor—have served to make the implications of its procedures more explicit and 
to provide a systematic rationale for their use. Construct validation has focused 
attention on the role of psychological theory in test construction and on the need 
to formulate hypotheses that can be proved or disproved in the validation 
process. Construct validation has also stimulated the search for novel ways of 
gathering validity data. Although several of the techniques employed in investi- 
gating construct validity had long been familiar, the field of operation has been 
expanded to admit a wider variety of procedures. 

The construct validity of a test is the extent to which the test may be said to 
measure a theoretical construct or trait. Examples of such constructs are scholas- 
tic aptitude, mechanical comprehension, verbal fluency, speed of walking, neu- 
roticism, and anxiety. Each construct is developed to explain and organize 
observed response consistencies, It derives from established interrelationships 
among behavioral measures. Construct validation requires the gradual accumula- 
tion of information from a variety of sources. Any data throwing light on the na- 
ture of the trait under consideration and the conditions affecting its development 
and manifestations represent appropriate evidence for this validation. Illustra- 
tions of specific techniques that contribute to construct identification are consid- 
ered in the following sections. 


"Recent applications, detailed explanations of procedures, and critical evaluations can be f. in Hartigan and 
Wigdor (1989), Hedges (1988), Hunter and Schmidt (1990), L. R. James, Demaree, Keck Sep (1992), 
L. V. Jones and Applebaum (1989), R. Rosenthal (1991), Schmidt (1992), Schmidt et al. (1993), and Schmidt, 
Ones, and Hunter (1992). For a simple introduction to statistical procedures, see E M. Wolf (1986). For a 
broader view of the use of meta-analysis in behavioral research, see Cook et al. (1992), Cooper and Hedges 
(1994), Hasselblad and Hedges (1995), and Wachter and Straf (1990). e 
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Developmental Changes. A major criterion employed in the validation of a 
number of traditional intelligence tests is age differentiation. Such tests as the 
Stanford-Binet and most preschool tests are checked against chronological age to 
determine whether the scores show a progressive increase with advancing age. 
Since abilities are expected to increase with age during childhood, it is argued 
that the test scores should likewise show such an increase, if the test is valid. The 
very concept óf an age scale of intelligence, as initiated by Binet, is based on the 
assumption that "intelligence" increases with age, at least until maturity. 

The criterion of age differentiation, of course, is inapplicable to any functions 
that do not exhibit clear-cut and consistent age changes. In the area of personal- 
ity measurement, for example, it has found limited use. Moreover, it should be 
noted that, even when applicable, age differentiation is a necessary but not a suf- 
ficient condition for validity. Thus, if the test scores fail to improve with age, such 
a finding probably indicates that the test is not a valid measure of the abilities it 
was designed to sample. On the other hand, to prove that a test measures some- 
thing that increases with age does not define the area covered by the test very pre- 
cisely. A measure of height or weight would also show regular age increments, 
although it would obviously not be designated an intelligence test. 

A final point should be emphasized regarding the interpretation of the age cri- 
terion. A psychological test validated against such a criterión measures behavior 
characteristics that increase with age under the conditions existing in the type of 
environment in which the test was standardized. Because different cultures may 
stimulate and foster the development of dissimilar behavior characteristics, it 
cannot be assumed that the criterion of age differentiation is a universal one. Like 
all other criteria, it is circumscribed by the particular cultural context in which it 
is derived. 

Developmental analyses are also basic to the construct validation of the Piaget- 
ian ordinal scales cited in chapters 3 and 9. A fundamental assumption of such 
scales is the sequential patterning of development, such that the attainment of ear- 
lier stages in concept development is prerequisite to the acquisition of later con- 
ceptual skills. There is thus an intrinsic hierarchy in the content of these scales. 
The construct validation of ordinal scales should therefore include empirical data 
on the sequential invariance of the successive steps. This involves checking the 
performance of children at different levels in the development of any tested con- 
cept, such as conservation or object permanence. Do children who demonstrate 
mastery of the concept at a given level also exhibit mastery at the lower levels? 


Correlations with Other Tests. Correlations between a new test and similar 
earlier tests are sometimes cited as evidence that the new test measures approxi- 
mately the same general area of behavior as other tests designated by the same 
name, such as “intelligence tests” or “mechanical aptitude tests.” Unlike the 
correlations found in criterion-prediction validity, these correlations should be 
moderately high, but not too high. If the new test correlates too highly with an 
already available test, without such added advantages as brevity or ease of 
administration, then the new test represents needless duplication. 
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Correlations with other tests are employed in still another way to demonstrate 
that the new test is relatively free from the influence of certain irrelevant factors. 
For example, a special aptitude test or a personality test should not have a high 
correlation with tests of general intelligence or scholastic aptitude. Similarly, 
reading comprehension should not appreciably affect performance on such tests. 
Accordingly, correlations with tests of general intelligence, reading, or verbal 
comprehension are sometimes reported as indirect or negative evidence of valid- 
ity. In these cases, high correlations would make the test suspect. Low correla- 
tions, however, would not in themselves ensure validity. It will be noted that this 
use of correlations with other tests is similar to one of the supplementary tech- 
niques described under content-description procedures. 


Factor Analysis. Developed as a means of identifying psychological traits, 
factor analysis is particularly relevant to construct-validation procedures, Es- 
sentially, factor analysis is a refined statistical technique for analyzing the 
interrelationships of behavior data. For example, if 20 tests have been given to 
300 persons, the first step is to compute the correlations of each test with every 
other. An inspection of the resulting table of 190 correlations may itself reveal 
certain clusters among the tests, suggesting the location of common traits. Thus, 
if such tests as vocabulary, analogies, opposites, and sentence completion have 
high correlations with one another and low correlations with all other tests, we 
could tentatively infer the presence of a verbal comprehension factor. Because 
such an inspectional analysis of a correlation table is difficult and uncertain, more 
Precise statistical techniques have been developed to locate the common factors 
required to account for the obtained correlations. These techniques of factor 
analysis are examined further in chapter 11, in connection with their use in 
research on the nature of intelligence, where they originated. 

In the process of factor analysis, the number of variables or categories in terms 
of which each individual’s performance can be described is reduced from the 
number of original tests to a relatively small number of factors, or common traits. 
In the example cited above, five or six factors might suffice to account for the in- 
tercorrelations among the 20 tests. Each individual might thus be described in 
terms of her or his scores in the five or six factors, rather than in terms of the orig- 
inal 20 scores. A major purpose of factor analysis is to simplify the description of 
behavior by reducing the number of categories from an initial multiplicity of test 
variables to a few common factors, or traits. 

After the factors have been identified, they can be utilized in describing the 
factorial composition of a test. Each test can thus be characterized in terms of the 
major factors determining its scores, together with the weight or loading of each 
factor and the correlation of the test with each factor. Such a correlation is some- 
times reported as the factorial validity of the test. Thus, if the verbal comprehen- 
sion factor correlates .66 with a vocabulary test, the factorial validity of this 
vocabulary test as a measure of the trait of verbal comprehension is .66. It should 

be noted that factorial validity is essentially the correlation of the test with what- 
ever 1$ common to a group of tests or other indices of behavior. The set of vari- 
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ables analyzed can, of course, include both test and nontest data. Ratings and 
other criterion measures can thus be utilized, along with other tests, to explore 
the factorial composition of a particular test and to define the common traits it 
measures. 


Internal Consistency. Inthe published description of certain tests, especially 
in the personality domain, the statement is made that the test has been validated 
by the method of internal consistency. The essential characteristic of this meth- 
od is that the criterion is none other than the total score on the test itself. 
Sometimes an adaptation of the contrasted group method is used, extreme groups 
being selected on the basis of the total test score. The performance of the upper 
criterion group on each test item is then compared with that of the lower 
criterion group. Items that fail to show a significantly greater proportion of 
“passes” (or keyed responses) in the upper than in the lower criterion group are 
considered invalid, and are either eliminated or revised. Correlational procedures 
may also be employed for this purpose. For example, the biserial correlation 
between “pass-fail” on each item and total test score can be computed. Only those 
items yielding significant item-test correlations would be retained. A test whose 
items were selected by this method can be said to show internal consistency, since 
each item differentiates among respondents in the same direction as does the 
entire test. 

Another application of the criterion of internal consistency involves the corre- 
lation of subtest scores with total score. Many intelligence tests, for instance, con- 
sist of separately administered subtests (such as vocabulary, arithmetic, picture 
completion, etc.) whose scores are combined in finding the total test score. In the 
construction of these tests, the scores on each subtest ate often correlated with 
total score, and any subtest whose correlation with total score is too low is elimi- 
nated. The correlations of the remaining subtests with total score are then re- 
ported as evidence of the internal consistency of the entire iristrument. 

It is apparent that internal consistency correlations, whether based on items or 
subtests, are essentially measures of homogeneity. Because it helps to characterize 
the behavior domain or trait sampled by the test, the degree of homogeneity of a 
test has some relevance to its construct validity. Nevertheless, the contribution of 
internal consistency data to test validation is limited. In the absence of data ex- 
ternal to the test itself, little can be learned about what a test measures. 


Convergent and Discriminant Validation. In a thoughtful analysis of 
construct validation, D. T. Campbell (1960) pointed out that, in order to dem- 
onstrate construct validity, we must show not only that a test correlates highly 
with other variables with which it should theoretically correlate but also that it 
does not correlate significantly with variables from which it should differ. In an 
earlier article, D. T. Campbell and Fiske (1959) described the former process as 
convergent validation and the latter as discriminant validation. Correlation of a 
quantitative reasoning test with subsequent grades in a math course would be an 
example of convergent validation. For the same test, discriminant validity would 
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be evidenced by a low and insignificant correlation with scores ona reading com- 
prehension test, since reading ability is an irrelevant variable in a test designed to 
measure quantitative reasoning. C qe p 

It will be recalled that the requirement of low correlation with irrelevant vari- 
ables was discussed in connection with supplementary and precautionary proce- 
dures followed in content validation. Discriminant validation is also especially 
relevant to the validation of personality tests, in which irrelevant variables may 
affect scores in a variety of ways- ; i í 

Campbell and Fiske (1959) proposed systematic experimental design for the 
dual approach of convergent and discriminant validation, which they called the 
multitrait-multimethod matrix. Essentially, this procedure requires the assessment of 
two or more traits by two or more methods. A hypothetical example provided by 
Campbell and Fiske will serve to illustrate the procedure. Table 5-1 shows all pos- 
sible correlations among the scores obtained when three traits are each measured 
by three methods. The three traits could represent three personality characteris- 
tics, such as (A) dominance, (B) sociability, and (C) achievement motivation. 
The three methods could be (1) a self-report inventory, (2) a projective tech- 
nique, and (3) peer ratings. Thus, Ay would indicate dominance scores on the 
self-report inventory, Az dominance scores on the projective test; Cs would rep- 
resent peer ratings on achievement motivation. 

The hypothetical correlations given in Table 5-1 include reliability coeffi- 
cients (in parentheses, along principal diagonal) and validity coefficients (in 
boldface, along three shorter diagonals). In these validity coefficients, the scores 
obtained for the same trait by different methods are correlated; each measure is 
thus being checked against other, independent measures of the same trait, as in 
the familiar validation procedure. The table also includes correlations between 
different traits measured by the same method (in solid triangles) and correlations 
between different traits measured by different methods (in broken triangles). For 
satisfactory construct validity, the validity coefficients should obviously be higher 
than the correlations between different traits measured by different methods; they 
should also be higher than the correlations between different traits measured by 
the same method. For example, the correlation between dominance scores from a 
self-report inventory and dominance scores from a projective test should be 
higher than the correlation between dominance and sociability scores from a self- 
report inventory. If the latter correlation, representing common method variance, 
were high, it might indicate, for example, that a person's scores on this inventory 
are unduly affected by some irrelevant common factor such as ability to under- 
stand the questions or desire to make oneself appear in a favorable light on all 
traits. 


Experimental Interventions. A further source of data for construct 
validation is provided by experiments on the effect of selected variables on test 
scores. In checking the validity of a test for use in an individualized instructional 
program, for example, one approach is through a comparison of pretest and 
posttest scores. The rationale of such a test calls for low scores on the pretest, 
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administered before the relevant instruction, and high scores on the posttest. 
This relationship can also be checked for individual items in the test. Ideally, the 
largest proportion of examinees should fail an item on the pretest and pass it on 
the posttest. Items that are commonly failed on both tests are too difficult, and 
those passed on both: tests too easy, for the purposes of such a test. If a sizable 
Proportion of examinees pass an item on the pretest and fail it on the posttest, 
there is obviously something wrong with the item, or the instruction, ot both. 

A test designed to measure anxiety-proneness can be administered to persons 
who are subsequently put through a situation designed to arouse anxiety, such as 
taking an examination under distracting and stressful conditions. The initial anxi- 
ety test scores can then be correlated with physiological and other indices of anxi- 
ety expression during and after the examination. A different hypothesis regarding 
an anxiety test could be evaluated by administering the test before and after an 
anxiety-arousing experience and seeing whether test scores rise significantly on the 
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retest. Positive findings from such an experiment would indicate that the test 
scores reflect current anxiety level. In a similar way, experiments can be designed 
to test other hypotheses regarding the trait measured by a given test. 


Structural Equation Modeling. In addition’ to the identification of 
constructs in both test scores and criterion performance, an important advance in 
test validation was the consideration of relations among constructs and of the 
path whereby a construct affects criterion performance (J. P. Campbell, 1990a; 
Messick, 1989; Schmidt, Hunter, & Outerbridge, 1986). For example, a person’s 
interest in a particular area may influence job performance through increasing her 
or his learning of relevant factual knowledge, through acquisition of required 
procedural skills, or through the development of the motivation needed to exert 
and sustain maximal effort in carrying out the necessary activities. Investigating 
how an identified construct or personal trait leads to good or poor performance 
contributes substantially to understanding why a test has high or low validity in a 
given situation. Such an analysis is facilitated by a statistical procedure known as 
structural equation modeling, whose use has increased impressively in the 1980s 
and 1990s. See, for example, the establishment of the journal, Structural Equation 
Modeling (1994). This procedure is closely related to several versions of path 
analysis, and both are often informally designated as “causal modeling.”* 
What, specifically, is structural equation modeling designed to accomplish, and 
how did it originate? In elementary statistics, one soon learns that correlation 
does not indicate causation. A familiar example is spurious age correlation. 
Among schoolchildren ranging in age form 6 to 14 years, the correlation between 
height and proficiency in arithmetic is likely to be high; but we cannot conclude 
that either variable affects the other. The correlation results largely from age, 
which is in turn associated with amount of schooling. In an effort to investigate 
causal relations, researchers in the 1960s and 1970s began to use cross-lagged ex- 
perimental designs (D. T. Campbell & Stanley, 1966; Cook & Campbell, 1976, 
pp. 284—293). For example, to analyze the reciprocal causal influence between a 
student’s attitude toward math and her or his math performance, measures of atti- 
tude toward math and performance in math could be obtained at two points in 
time: The cross-lagged correlation could then be found between math attitude at 
Time 1 and math performance at Time 2, and between math performance at Time 
1 and math attitude at Time 2. The size of the two correlations would indicate the 
relative strength of the influence in each direction. For a few years, this seemed a 
promising way to assess the effect of two variables on each other. 
Before long, however, both logical and statistical analyses revealed serious 
weaknesses in the use of cross-lagged correlations. Although the basic cross- 
lagged design was excellent, the use of simple, zero-order correlations was likely to 


?[n order to avoid philosophical implications about ultimate or complete causation of any event, psychologists 
have preferred more neutral expressions, such as the statements that A determines, influences, or affects B. Nev- 
ertheless, the term "causal" is sometimes used to refer to all these relations, with the assumption that its limita- 
tions are understood (see, e.g., L. R. James, Mulaik, & Brett, 1982, chap. 1; P. A. Whire, 1990). 
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distort results and lead to incorrect conclusions about causal: relations (Rogosa, 
1980). Among the sources of error in this procedure are the failure to take into 
account, first, correlations between the initial variables and between the subse- 
quent variables; second, the reliability of the variables and their stability over 
time; and third, the possible contribution of unmeasured variables, such as age 
and schooling in the classical example cited above. Structural equation modeling 
provides ways of avoiding such difficulties. Essentially, it does so by using regres- 
sion equations to predict the dependent from the independent variables in cross- 
lagged or other causal models. In this procedure, partial correlations are used in 
finding the regression coefficients, thereby incorporating all correlations among 
the variables; both measurement and sampling errors are taken into account; and 
some provision is made at least to recognize the possibility of additional, unmea- 
sured causal variables (Bentler, 1988; L. R: James et al., 1982; Loehlin, 1992; Ro- 
gosa, 1979). 

The first step in structural equation modeling is to design a model of the hy- 
pothesized causal relations to be tested. It is important that this model be based 
on thorough familiarity with existing knowledge about the variables and the situ- 
ation under investigation. The hypothesized relations should have a sound theo- 
retical rationale. The actual testing of the model is done by solving a set of 
simultaneous linear regression equations.* In causal modeling, there are usually 
more equations than unknowns, which permits solutions for several alternative 
models. Each model is compared with the original, empirical correlation matrix 
for goodness of fit. However, several causal models may fit the data about equally 
well (MacCallum, Wegener, Uchino, & Fabrigar, 1993). These statistically equiv- 
alent models may represent different causal paths and hence provide alternative 
explanations for the empirically observed effects. On the basis of the investiga- 
tors knowledge of the situation, he or she must then evaluate the alternative 
models for their plausibility and substantive meaningfulness. f 

Another feature of structural equation modeling is that causal relations are 
typically computed between constructs, rather than between isolated measured 
variables. For instance, to assess a student's attitude toward math, several indica- 
tors could be used, such as measures of interest, goal orientation, self-concept of 
one's math aptitude, and other relevant affective variables. The common vari- 
ance among these indicators would then define a construct of the individual's at- 
titude toward math, which can itself be related to subsequent math performance. 
This use of constructs provides more stable and reliable estimates, in which the 
error and specific variances of the separate indicators cancel out. 

There are currently several methodological approaches to structural equation 
modeling, as well as individual modifications and procedural refinements under 
consideration (see, e:g Anderson & Gerbing, 1988; Bentler, 1990; Bollen & 
Long, 1993; Breckler, 1990; Cole, Maxwell, Arvey, & Salas, 1993; James, 1980; 


*For an introduction to detailed procedure, see Bollen (1989) and Loehlin (1992). The actual computations can 
be carried out by available computer programs, such as LISREL (Hayduk, 1988; Joreskog & Sórbom, 1986, 1989) 
and EQS (Bentler, 1985). 
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Mulaik et al., 1989). While still in a state of development, however, structural 
equation modeling is a promising procedure for combining theoretical, experi- 
mental, and statistical approaches. It has already been widely applied to problems 
in such areas as developmental, personality and social, industrial, and educational 
psychology (e.g., Graves & Powell, 1988; L. A. James & L. R. James, 1989; Mac- 
Callum & Browne, 1993; McCardle, 1989; Parkerson, Lomax, Schiller, & Wal- 
berg, 1984; Shavelson & Bolus, 1982). Efforts at unification and simplification of 
' procedure are also under way (e.g., Jóreskog & Sórbom, 1993). 


Contributions from Cognitive Psychology. The 1970s witnessed a rap- 
prochement of experimental psychology and psychometrics, which is beginning 
to yield significant contributions to an understanding of the constructs assessed by 
tests of intelligence and other broadly defined aptitudes (Ronning, Glover, 
Conoley, & Witt, 1987; R. E. Snow & Lohman, 1989). As early as the 1950s, 
cognitive psychologists had begun to apply the concepts of information processing 
to the study of human problem-solving. Some investigators designed computer 
programs that carry out these processes and thereby simulate human thought. 
Programs can be written to simulate the performance of persons at different levels 
of skill, and with such programs it is possible to predict the number and kinds of 
errors made and the time required for different responses. In designing a program, 
the investigator usually begins with a task analysis that may include introspecting 
about one's own method of solving the problem, having subjects think aloud, or 
using more refined observational procedures. By comparing the performance of 
the computer with that of children and adults, or that of experts and novices, in 
solving the same problem, investigators can test their hypotheses regarding what 
persons do in carrying out the tasks. Examples of the tasks investigated by these 
methods include conventional puzzles, problems in logic, chess games, algebra 
word problems, physics problems, and diagnostic problems in medical practice 
(Chi, Glaser, & Farr, 1988; J. H. Larkin, McDermott, Simon, & Simon, 1980, 
1980b; Newell & Simon, 1972; Simon, 1976). 

The variables identified by these investigations include processes (procedural 
skills) and declarative knowledge (facts and information). The cognitive models 
specify the intellectual processes used to perform the task, the way the processes 
are organized, the relevant knowledge store, and how this knowledge is repre- 
sented in memory and retrieved when needed. Increasing attention is also being 
given to what has been called an executive process or metacognition, which 
refers to the control the individual exercises over her or his own choice of 
processes, representations, and strategies for carrying out the task. In the 1970s, 
some cognitive psychologists began to apply these task-analysis and computer- 
simulation techniques to an exploration of what intelligence tests measure. Indi- 
vidual investigators approached this problem from several different angles (see 
Resnick, 1976; Sternberg, 1981, 1984, 1985b). The resulting research is gradually 
contributing to significant advances in test construction and use. 

The implications of cognitive research for construct validation are especially 
well demonstrated in the work of Embretson (1983, 1986, 19952). Observing that 
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traditional construct validation has been too limited in its approach, Embretson 
proposed two principal aspects of test validation: (1) construct representation and 
(2) nomothetic span. It is on the second, nomothetic span, that traditional con- 
struct-related validation has focused. This is concerned with the relations of test 
performance within a “nomothetic network” of other variables. Such relations are 
generally investigated through correlations of test scores with other measures, 
which may include criterion performance and other real-life data. 

The object of construct representation, on the other hand, is to identify spe- 
cific information-processing components and knowledge stores needed to perform 
the tasks set by the test items. Task decomposition? can be applied experimentally 
to conduct such analyses. Examples of possible procedures include the manipula- 
tion of task complexity, the presentation of partial tasks, or the provision of cues 
that alter the task demands. Mathematical models have been developed to mea- 
sure the contribution of different response components to item performance. An- 
other widely used procedure for cognitive task analysis is protocol analysis 
(Ericsson, 1987; Ericsson & Simon, 1993; van Someren, Barnard, & Sandberg, 
1994). In this procedure, individuals are instructed to “think aloud” as they per- 
form a task or solve a problem. The tasks could range from mentally multiplying 
two given numbers, recalling details of a past event, or locating the cause of an 
equipment malfunction to answering successive items on an ability test. A by- 
product of this procedure is the possible discovery that the same test item may 
evoke quite different cognitive processes in respondents who vary in their experi- 
ential backgrounds. 

What can we conclude regarding the contribution that cognitive psychology 
has made thus far to construct-related validation? While still in an exploratory 
stage, information-processing approaches have contributed heuristic concepts to 
guide further research. They have clearly focused attention on response processes, 
in contrast to the traditional concentration on the end products of thought in 
psychometric research. Analyzing test performance in terms of specific cognitive 
processes should certainly strengthen and expand our understanding of what the 
tests measure. Moreover, analyzing individuals’ performance at the level of ele- 
mentary component processes should eventually make it possible to pinpoint 
each person’s sources of weakness and strength and thereby enhance the diagnos- 
tic use of tests (Embretson, 1987, 1994; Estes, 1974; Pellegrino & Glaser, 1979; 
Sternberg & Weil, 1980). This, in turn, should facilitate the tailoring of training 
programs to the individual’s needs. In summary, the relation between the psycho- 
metric and cognitive approaches can be characterized, first, from the standpoint 
of applied research and practice as complementary. Here, each is the preferred ap- 
proach for different assessment purposes. Second, from the standpoint of basic re- 
search and theory, their relation can be characterized as reciprocal. Here, each 
helps to clarify and enrich the other; in combination, they enhance our under- 
standing of intelligent behavior. ( 


For details, see Butterfield, Nielsen, Tangen, and Richardson (1985), Embretson (1985b), and Sternberg (1977, 
1980). 
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OVERVIEW AND INTEGRATION 


Comparison of Validation Procedures. We have considered several ways 
of asking, “How valid is this test?” To point up the distinctive features of the 
different validation procedures, let us apply each in turn to a test consisting of 50 
assorted arithmetic problems. Four ways in which this test might be employed, 
together with the type of validation procedure appropriate to each, are illustrated 
in Table 5-2. This example highlights the fact that the choice of validation 
procedure depends on the use to be made of test scores. The same test, when 
employed for different purposes, should be validated in different ways. If an 
achievement test is used to predict subsequent performance at a higher edu- 
cational level, as when selecting high school students for college admission, it 
needs to be evaluated against the criterion of subsequent college performance 
rather than in terms of its content validity. 


Inclusiveness of Construct Validation. The examples given in Table 5-2 
focus on the differences among the various types of validation procedures. Further 
consideration of these procedures, however, shows that content-, criterion-, and 
construct-related validities do not correspond to distinct or logically coordinate 
categories. On the contrary, construct-related validity is a comprehensive 
concept, that includes the other types. All the specific techniques for content 
analysis and for measuring criterion relations, which were discussed in earlier 
sections of this chapter, could have been listed again under construct iden- 
tification. For example, the correlations of a mechanical aptitude test with 
performance in shop courses and in a wide variety of jobs contribute to our 
understanding of the construct measured by the test. The construct identification 
would be further strengthened by comparing the scores of contrasted groups of 
successful and unsuccessful workers. 


Validation of a Single Arithmetic Test for Different Purposes 
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Validity against various practical criteria is commonly reported in test manuals 
to aid the potential user in understanding what a test measures. Although he or 
she may not be directly concerned with the prediction of any of the specific crite- 
ria employed, by examining such criteria the test user is able to build up a concept 
of the behavior domain sampled by the test. If we follow this thinking a bit fur- 
ther, we can see that all test use and all interpretation of test scores imply con- 
struct validity, a fact that is being increasingly recognized (J. P. Campbell, 1990a; 
Guion, 1991; Messick, 1980b, 1988, 1989; Tenopyr, 1986). Since tests are rarely, 
if ever, used under conditions that are identical with those under which validity 
data were gathered, some degree of generalizability of results is inevitably in- 
volved. The interpretive meaning of test scores is always based on constructs, 
which may vary widely in breadth or generalizability with regard to behavior do- 
mains, populations, and contexts. 

Messick (1980b, 1989) argued convincingly that the term “validity,” insofar as 
it designates the interpretive meaningfulness of a test, should be reserved for con- 
struct validity. Other procedures with which the term "validity" has been tradi- 
tionally associated should, he maintained, be designated by: more specifically 
descriptive labels. Thus, content validity can be labeled "content relevance" and 
"content coverage," to refer to domain specifications and domain representative- 
ness, respectively. Criterion-related validity can be labeled "predictive utility" 
and “diagnostic utility,” to correspond to predictive and concurrent validation. 
These more clearly descriptive labels undoubtedly contribute to a better under- 
standing of what the various procedures actually accomplish. Nevertheless, the 
distinctions among different types of validation procedures are useful in identify- 
ing tests for particular purposes. They should therefore be reported in test manu- 
als in a clearly identifiable form. 

From another angle, even when the immediate practical application focuses on 
content description (as in. educational assessment) or criterion prediction (as in 
job selection), the use of constructs of appropriate breadth is more effective than 
are measures of specific test performance. Criterion research makes it increasingly 
clear that both criterion measures and test scores can be more effectively ex- 
pressed as matching constructs. Furthermore, the investigation of causal relations 
among constructs, as in structural equation modeling, is coming to be accepted as 
a significant contribution to the understanding of how and why tests work: 


Validation in the Test Construction Process. ltisbeing increasingly rec- 
ognized that the development of a valid test requires multiple procedures, which 
are employed sequentially, at different stages of test construction (Anastasi, 
1986a; Guion, 1991; Jackson, 1970, 1973; N. G. Peterson et al., 1990). Validity is 


thus built into the test from the outset; rather than being limited to the last stages 


of test development, as in traditional, criterion-related validation. The validation 


process begins with the formulation of detailed trait or construct definitions, 


An example of the possible application of these more sophisticated test validation procedures can be found in 


L. A. King and D. W. King (1990). 
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derived from psychological theory, prior research, or systematic observation and 
analyses of the relevant behavior domain. Test items are then prepared to fit the 
construct definitions. Empirical item analyses follow, with the selection of the 
most. effective, or valid, items from the initial item pools. Other appropriate 
internal analyses may then be carried out, including statistical analyses of item 
clusters or subtests. The final stage includes validation of various scores and 
interpretive combinations of scores through statistical analyses against external, 
real-life criteria. 

Almost any information gathered in the process of developing or using a test is 
relevant to its validity. Certainly, data on internal consistency and on retest relia- 
bility help to define the homogeneity of the construct and its temporal stability. 
Norms may well provide additional construct specification, especially if they in- 
clude separate normative data for subgroups classified by age, sex, or other demo- 
graphic variables that affect the individual's experiential history and thereby her 
or his test performance. Moreover, after a test is released for operational use, the 
interpretive meaning of its scores may continue to be clarified and enriched 
through the gradual accumulation of clinical observations and through special re- 
search projects." 


Individual and Social Consequences of Testing. Some psychometricians 
recommend that an additional feature be included in the concept of test validity, 
namely, the ‘consequences of testing on individuals and on society. A notable 
exponent of this broadening of the validity concept is Messick (1980b, 1988, 
1989, 1995). Special emphasis is placed on the unintended consequences of 
specific uses of tests that may be detrimental to individuals or to members of 
certain ethnic or other populations with diverse experiential backgrounds. An 
excellent analysis of the problems of balancing different goals and values in the 
assessment of job applicants is illustrated by the report of a committee of experts 
convened by the National Research Council, which studied the situation with 
b es thoroughness (Hartigan & Wigdor, 1989—see especially chaps. 13 and 
14). 

The ethical and social implications of test use certainly demand widespread 
concern, Some introduction to these problems will be given in chapter 18. More 
technical aspects are treated in chapter 6, in connection with the question of 
“test bias.” However, as noted by other psychometricians (e.g., Cole & Moss, 
1989), to-incorporate these questions in the concept of validity does not appear 
to be the most effective way to meet them. They cannot be answered by means of 
empirical data and statistical analysis alone. Nor should the desired values be 
concealed in statistical manipulations. They need to be explicitly stated and dis- 
cussed as an independent goal, to be considered in addition to the strictly empiri- 
cal and statistically demonstrated validity of the particular test use. A wise 
decision regarding the balancing of conflicting goals is reached by methods appro- 
priate to the handling of value systems (Mullen & Roth, 1991; Zeichmeister & 


"For a successful application of this comprehensive model of test validation, see Elliott (1990b, chap. 9). 
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Johnson, 1992).8 Such methods require human judgment, systematic debate, 
compromises, and conflict resolution, in which the exponents of different value 
systems are adequately represented. To combine the empirical, statistically sup- 
ported procedures of determining validity with an evaluation of the social and 
ethical consequences of particular test uses only confuses and obfuscates the solu- 
tion. 

One conclusion that emerges from a consideration of this difficult and impor- 
tant problem is to reaffirm the major role of the test user, already suggested in 
chapter 1. When value judgments are required, especially in individual cases, one 
more serious responsibility is placed on the test user. In the choice of appropriate 
tests, as well as in the interpretation of results, the test user can control the conse- 
quences of testing. The broad value orientation and social sensitivity of such a 
test user can contribute significantly to the proper use of tests, not only from the 
scientific but also from the ethical standpoint.” 


- Validity; Measurement 
and Interpretation 


hapter 5 was concerned with concepts of validity and sources of validation 
data; this chapter deals with quantitative expressions of validity and their in- 
terpretation. Test users are concerned with validity at either or both of two 
stages. First, when considering the suitability of a test for their purposes, they ex- 
amine available validity data reported in the test manual or other published reports. 
Through such information, they arrive at a tentative concept of what psychological 
functions the test actually measures, and they judge the relevance of such functions 
to their proposed use of the test. In effect, when test users rely on published valida- 
tion data, they are dealing with construct validity, regardless of the specific proce- 
dures used in gathering the data. As we have seen in chapter 5, the criteria employed 
in published studies cannot be assumed to be identical with those the test user wants 
to predict. Jobs bearing the same title in two different companies are rarely identical. 
‘Two courses in freshman English taught in different colleges may be quite dissimilar. 
Hence, some degree of validity generalization is required in test selection. 

Because of the diversity of testing needs and of the inferences to be drawn from 
test scores, some test users may wish to check the validity of any chosen test against 
local criteria. Although published data may strongly suggest that a given test should 
have high validity in a particular situation, direct corroboration is desirable when 
technically feasible. The determination of validity against specific local criteria rep- 
resents the second stage in the test users’ evaluation of validity. The techniques to 
be discussed in this chapter are especially relevant to the analysis of validity data 
obtained by test users themselves. Most of them are also useful, however, in under- 
standing and interpreting the validity data reported in test manuals. 
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VALIDITY COEFFICIENT AND ERROR OF ESTIMATE 


" 
Measurement of Relationship. A validity coefficient is a correlation 
between test score and criterion measure. Because it provides a single numerical 
index of test validity, it is commonly used in test manuals to report the validity of 
a test against each criterion for which data are available. The data used in 
computing any validity coefficient can also, be expressed in the form of an 
expectancy table or expectancy chart, illustrated in chapter 3. In fact, such tables 
and charts provide a convenient way to show what a validity coefficient means for 
the person tested. It will be recalled that expectancy charts give the probability 
that an individual who obtains a certain score on the test will attain a specified 
level of criterion performance. For example, with Table 3-6, if we know a 
student's score on the Numerical Reasoning test of the Differential Aptitude Tests 
(DAT), we can look up the chances that he or she will earn a particular grade in a 
seventh grade mathematics course. The same data yield a validity coefficient of 
.60. When both test and criterion variables are continuous, the familiar Pearson 
Product-Moment Correlation Coefficient is applicable. Other types of correlation 
coefficients can be computed when the data are expressed in different forms, as 
when a twofold pass-fail criterion is employed (e.g., Fig. 3-7). The specific 
procedures for computing these different kinds of correlations can be found in any 
standard statistics text. 


Conditions Affecting, Validity Coefficients. As in the case of reliability, 
it is essential to specify the nature of the group on which a validity coefficient is 
found. The same test may measure different functions when given to individuals 
who differ in age, sex, educational level, occupation, or any other relevant 
characteristic. Persons with different experiential backgrounds, for example, may 
utilize different work methods to solve the same test problem. Consequently, a 
test could have high validity in predicting a particular criterion in one population 
and little or no validity in another. Or it might be a valid measure of different 
functions in the two populations. Tests designed for use with diverse populations 
should cite appropriate. data on population generalizability in their technical 
manuals. Moreover, within a population that varies widely in test scores, the 
validity coefficient may differ appreciably in different parts of the score range and 
should be checked in appropriate subgroups (R. Lee & Foley, 1986). T 

The question of sample heterogeneity is relevant to the measurement of validity, 
as it is to the measurement of reliability, since both characteristics are commonly 
reported in terms of correlation coefficients. It will be recalled that, other things 
being equal, the wider the range of scores, the higher will be the correlation. This 
fact should be kept in mind when interpreting the validity coefficients given in 
test manuals. 

A special difficulty encountered in many validation samples arises from prese- 
lection. For example, a new test that is being validated for job selection may be 
administered to a group of newly hired employees on whom criterion measures 
of job performance will eventually, be available. It is likely, however, that such 
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employees represent a superior selection of all those who applied for the job. 
Hence, the range of such a group in both test scóttes and criterion measures will be 
curtailed at the lower end of the distribution. The effect of such preselection will 
therefore be to lower the validity coefficient. In the subsequent use of the test, 
when it is administered to all applicants for selection purposes, the validity can be 
expected to be somewhat higher. 

Validity coefficients may also change over time because of changing selection 
standards. An example is provided by a comparison of validity coefficients com- 
puted over a 30-year interval with Yale students (Burnham, 1965). Correlation- 
were found between a predictive index based on College Board tests and higl 
school records, on the one hand, and average freshman grades, on the other. Thi 
correlation dropped from .71 to .52 over the 30 years. An examination of the bi 
variate distributions clearly revealed the reason for this drop. Because of highe: 
admission standards, the later class was more homogeneous than the earlier class 
in both predictor and criterion performance. Consequently, the correlation was 
lower in the later group, although the accuracy with which individuals’ grades 
were predicted showed little change. In other words, the observed drop in correla- 
tion did not indicate that the predictors were less valid than they had been 30 
years earlier. Had the differences in group homogeneity been ignored, it might 
have been wrongly concluded that this was the case. 

For the proper interpretation of a validity coefficient, attention should also be 
given to the form of the relationship between test and criterion. The computation 
of a Pearson correlation coefficient assumes that the relationship is linear and 
uniform throughout the range. Research on the relationship of test scores to job 
performance has shown that these conditions are generally met (Coward & Sack- 
ett, 1990; Hawk, 1970). Nevertheless, special circumstances may alter this rela- 
tionship, and the test user should be alert to such possibilities. For example, a 
particular job may require a minimum level of. reading comprehension, to enable 
employees to read instruction manuals, labels, and the like. Once this minimum 
is exceeded, however, further increments in reading ability may be unrelated to 
degree of job success. This would be an example of a nonlinear relation between 
test and job performance. An examination of the bivariate distribution or scatter 
diagram obtained by plotting reading comprehension scores against criterion 
measures would show a rise in job performance up to the minimal required read- 
ing ability and a leveling off beyond that point. Hence, the entries would cluster 
around a curve rather than a straight line. 

In other situations, the line of best fit may be a straight line, but the individual 
entries may deviate further around this line at the upper than at the lower end of 
the scale. Suppose that performance on a scholastic aptitude test is a necessary 
but not a sufficient condition for successful achievement in a course. That is, the 
low-scoring students will perform poorly in the course; but among the high- 
scoring students, some will perform well in the course and others will perform 
poorly because of low motivation, lack of interest, or other adverse conditions. In 
this situation, there will be wider variability of criterion performance among the 
high-scoring than among the low-scoring students. This condition in a bivariate 
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distribution is known as heteroscedasticity. The Pearson correlation assumes 
homoscedasticity or equal variability throughout the range of the bivariate distri- 
bution. In the present example, the bivariate distribution would be fan-shaped— 
wide at the upper end and narrow at the lower end. An examination of the 
bivariate distribution itself will usually give a good indication of the nature of the 
relationship between test and criterion. Expectancy tables and expectancy charts 
also correctly reveal the relative effectiveness of the test at different levels. 


Magnitude of a Validity Coefficient. How high should a validity coeffi- 
cient be? No general answer to this question is possible, since the interpretation 
of a validity coefficient must take into account a number of concomitant circum- 
stances. The obtained correlation, of course, should be high enough to be statisti- 
cally significant at some acceptable level, such as the .01 or .05 levels discussed in 
chapter 4. In other words, before drawing any conclusions about the validity of a 
test, we should be reasonably certain that the obtained validity coefficient could 
not have arisen through chance fluctuations of sampling from a population 
correlation of zero. ; 

Having established a significant correlation between test scores and criterion, 
however, we need to evaluate the size of the correlation in the light of the uses to 
be made of the test. If we wish to predict an individual's exact criterion score, 
such as the grade-point average a student will receive in college, the validity coef- 
ficient may be interpreted in terms of the standard error of estimate (SE,,.), which 
is analogous to the error of measurement discussed in connection with reliability. 
It will be recalled that the error of measurement indicates the margin of error to 
be expected in an individual's score as a result of the unreliability of the test. Sim- 
ilarly, the error of estimate shows the margin of error to be expected in the indi- 
vidual’s predicted criterion score, as a result of the imperfect validity of the test. 

The error of estimate is found by the following formula: 


SE, = SD, VI = n? 


in which rą? is the square of the validity coefficient and SD, is the standard de- 
viation of the criterion scores. It will be noted that if the validity were perfect 
(r = 1.00), the error of estimate would be zero. On the other hand, with a test 
having zero validity, the error of estimate is as large as the standard deviation 
of the criterion distribution (SE, = SD, VI — 0 2 SD;). Under these conditions, 
the prediction is no better than a guess; and the range of prediction error is as 
wide as the entire distribution of criterion scores. Between these two ex- 
tremes are to be found the errors of estimate corresponding to tests of varying 
validity. 

Kg to the formula for SE will show that term V1 — re? serves to indi- 
cate the size of the error relative to the error that would result from a mere guess (i.e., 
with zero validity). In other words, if V1 — rs is equal to 1.00, the error of esti- 
mate is as large as it would be if we were to guess the individual's criterion score. 
The predictive improvement attributable to the use of the test would thus be nil. 
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If the validity coefficient is .80, then puo is equal to .60, and the error is 
60% as large as it would be by chance. To put it differently, the use of such a test 
enables us to predict the individual's criterion performance with a margin of error 
that is 4096 smaller than it would be if we were to guess. 

It would thus appear that even with a validity of .80, which is unusually high, 
the error of predicted scores is considerable. If the primary function of psycholog- 
ical tests were to predict each individual's exact position in the criterion distribu- 
tion, the outlook would be quite discouraging. When examined in the light of the 
error of estimate, most tests do not appear very efficient. In most testing situa- 
tions, however, it is not necessary to predict the specific criterion performance of 
individual cases, but rather to determine which individuals will exceed a certain 
minimum standard of performance, or cutoff point, in the criterion. What are the 
chances that Mary Greene will graduate from medical school, that Tom Higgins 
will pass a course in calculus, or that Beverly Bruce will succeed as an astronaut? 
Which applicants are likely to be satisfactory clerks, insurance agents, or machine 
operators? Such information is useful not only for group selection but also for in- 
dividual career planning. For example, it is advantageous for a student to know 
that she has a good chance of passing all courses in law school, even if we are un- 
able to estimate with certainty whether her grade average will be 74 or 81. 

A test may appreciably improve predictive efficiency if it shows any significant 
correlation with the criterion, however low. Under certain circumstances, even va- 
lidities as low as .20 or .30 may justify inclusion of the test in a selection program. For 
many testing purposes, evaluation of tests in terms of the error of estimate is unrealis- 
tically stringent. Consideration must be given to other ways of evaluating the contri- 
bution of a test, ways that take into account the types of decisions to be made from 
the scores. Some of these procedures will be illustrated in the following section. 


TEST VALIDITY AND DECISION THEORY 


Basic Approach. Let us suppose that 100 applicants were given an aptitude 
test and were followed up until each could be evaluated for success on a certain 
job. Figure 6-1 shows the bivariate distribution of test scores and measures of job 
success for the 100 employees. The correlation between these two variables is 
slightly below .70. The minimum acceptable job performance, or criterion cutoff 
point, is indicated in the diagram by a heavy horizontal line. The 40 cases falling 
below this line would represent job failures; the 60 above the line, job successes. If 
all 100 applicants are hired, therefore, 6096 will succeed on the job. Similarly, if a 
smaller number were hired at random, without reference to test scores, the 
proportion of successes would probably be close to 6096. Suppose, however, that 
the test scores are used to select the 45 most. promising applicants out of the 100 
(selection ratio — .45). In such a case, the 45 individuals falling to the right of the 
heavy vertical line would be chosen. Within this group of 45, it can be seen that 
there are 7 job failures, or false acceptances, falling below the heavy horizontal 
line, and 38 job successes. Hence, the percentage of job successes is now 84 rather 
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Figure 6-1. Increase in the 
Selection Test. tf 
Ens TABE adt, bast os wore ob ota e aen, 
than 60 (i.e., 38/45 =.84). This increase is attributable to the use of the test as a 
screening instrument. It will be noted that errors in predicted criterion score that 
do not affect the decision can be ignored. Only those prediction errors that cross 
the cutoff line and hence place the individual in the wrong category will reduce 
the selective effectiveness of the rest = a SET 
For a complete evaluation of the effectiveness of the test as a screening instru- 
ment, another category of cases in Figure 6-1 must also be examined. This is the . 
category of false rejections, comprising the 22 persons who score below the cutoff 
point on the test but above the criterion cutoff. From these data, we would esti- 
mate that 22% of the total applicant sample are potential job successes who will 
be lost if the test is used as a screening device with the present cutoff point. In set- 
ting a cutoff score on a test, attention should be given to the percentage of false 
rejects, as well as to the percentages of successes and failures within the selected 
group. In certain situations, the cutoff point should be set sufficiently high to ex- 
clude all but a few possible failures. This would be the case when the job is of such » 
a nature that a poorly qualified worker could cause serious loss or damage. An ex- 
ample would be à commercial airplane pilot. Under other circumstances, it may 
more important to admit as many qualified persons as possible, at the risk of in- 
cluding more failures. In the latter case, the number of false rejects can be reduced 
by the choice of a lower cutoff score. Other factors that normally determine the 
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position of the cutoff score include the available personnel supply, the number of 
job openings, and the urgency or speed with which the openings must be filled.! 

In many personnel decisions, the selection ratio is determined by the practical 
demands of the situation. Because of supply and demand in filling job openings, 
for example, it may be necessary to hire the top 4096 of applicants in one case and 
the top 7596 in another. When the selection ratio is not externally imposed, the 
cutting score on a test can be set at that point giving the maximum differentia- 
tion between criterion groups. This can be done roughly by comparing the distri- 
bution of test scores in the two criterion groups. More precise mathematical 
procedures for setting optimal cutting scores have also been worked out (Darling- 
ton & Stauffer, 1966; I. Guttman & Raju, 1965; Jaeger, 1989; Livingston & Zieky, 
1982; Martin & Raju, 1992; Rorer, Hoffman, & Hsieh, 1966). These procedures 
make it possible to take into account other relevant parameters, such as the rela- 
tive seriousness of false rejections and false acceptances. Insofar as such evalua- 
tions are incorporated in the process, however, human judgment needs to be 
involved at some point. 

In the terminology of decision theory, the example given in Figure 6-1 illus- 
trates a simple strategy, or plan for deciding which applicants to accept and which 
to reject. In more general terms, a strategy is a technique for utilizing information 
in order to reach a decision about individuals. In this case, the strategy was to ac- 
cept the 45 persons with the highest test scores. The increase in percentage of 
successful employees from 60 to 84 could be used as a basis for estimating the net 
benefit resulting from the use of the test. 

Statistical decision theory was developed by Wald (1950) with special refer- 
ence to the decisions required in the inspection and quality control of industrial 
products. Many of its implications for the construction and interpretation of psy- 
chological tests have been systematically worked out by Cronbach and Gleser 
(1965). Essentially, decision theory is an attempt to put the decision-making 
process into mathematical form, so that available information may be used to ar- 
rive at the most effective decision under specified circumstances. Some of the 
basic concepts of decision theory are proving helpful in the reformulation and 
clarification of certain questions about tests. A few of these ideas were introduced 
into testing before the formal development of statistical decision theory and were 
later recognized as fitting into that framework. 


Prediction of Outcomes. A precursor of decision theory in psychological 
testing is to be found in the Taylor-Russell tables (1939), which permit a determi- 
nation of the net gain in the selection accuracy attributable to the use of a test. 
The information required includes the validity coefficient of the test, the propor- 
tion of applicants who must be accepted (selection ratio), and the proportion of 
successful applicants selected without the use of a test (base rate). A change in 
any of these three conditions can alter the predictive efficiency of the test. 


‘Similar points were made from a different angle in the introductory discussion of cutoff scores in chapter 3. 
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For purposes of illustration, one of the Taylor-Russell tables has been repro- 
duced in Table 6-1. This table is designed for use when the base rate, or percent- 
age of successful applicants selected prior to the use of the test, is 60. Other tables 
are provided by Taylor and Russell for other base rates. Across the top of the table 
are given different values of the selection ratio, and along the side are the test va- 
lidities. The entries in the body of the table indicate the proportion of successful 
persons selected after the use of the test. Thus, the difference between .60 and any 
one table entry shows the increase in proportion of successful selections attribut- 
able to the test. 

Obviously if the selection ratio were 10096—that is, if all applicants had to be ac- 
cepted—no test, however valid, could improve the selection process. Reference to 


Table 6-1 


Proportion of "Successes" Expected Through the Use of Test of 
Given Validity and Given Selection Ratio, for Base Rate .60. 


Selection Ratio 


Validity .05 .10 .20 .30 — .40 50 — 20 .70 .80 .90 .95 
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Note. A full set of tables can be found in H. C. Taylor and Russell (1939) and in McCormick and Ilgen (1980, 
Appendix B). 
(From H. C. Taylor & Russell, 1939, p. 576.) 
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Table 6-1 shows that, when as many as 95% of applicants must be admitted, even a 
test with perfect validity (r = 1.00) would raise the proportion of successful persons 
by only 3% (.60 to .63). On the other hand, when only 5% of applicants need to be 
chosen, a test with a validity coefficient of only .30 can raise the percentage of suc- 
cessful applicants selected from 60 to 82. The rise from 60 to 82 represents the incre- 
mental validity of the test (Sechrest, 1963), or the increase in predictive validity 
attributable to the test. It indicates the contribution the test makes to the selection 
of individuals who will meet the minimum standards in criterion performance. In 
applying the Taylor-Russell tables, of course, test validity should be computed on 
the same sort of group used to estimate percentage of prior successes. In other words, 
the contribution of the test is not evaluated against chance success unless appli- 
cants were previously selected by chance—a most unlikely circumstance. If appli- 
cants had been selected on the basis of previous job history, letters of recommen- 
dation, and interviews, the contribution of the test should be evaluated on the basis 
of what the test adds to these previous selection procedures. 

The incremental validity resulting from the use of a test depends not only on 
the selection ratio but also on the base rate. In the previously illustrated job selec- 
tion situation, the base rate refers to the proportion of successful employees prior 
to the introduction of the test for selection purposes. Table 6-1 shows the antici- 
pated outcomes when the base rate is .60. For other base rates, we need to consult 
the other appropriate tables in the cited reference (H. C. Taylor & Russell, 1939). 
Let us consider an example in which test validity is .60 and the selection ratio is 
4096. Under these conditions, what would be the contribution of incremental va- 
lidity of the test if we begin with a base rate of 5096? And what would be the con- 
tribution if we begin with more extreme base rates of 1096 and 9096? Reference to 
the appropriate Taylor-Russell tables for these base rates shows that the percent- 
age of successful employees would rise from 50 to 75 in the first case; from 10 to 21 
in the second; and from 90 to 99 in the third. Thus, the improvement in percent- 
age of successful employees attributable to the use of the test is 25 when the base 
rate was 50, but only 11 and 9 when the base rates were more extreme. 

The implications of extreme base rates are of special interest in clinical psy- 
chology, where the base rate refers to the frequency of the pathological condition 
to be diagnosed in the population tested (Buchwald, 1965; Cureton, 19572; 
Meehl & Rosen, 1955; J. S. Wiggins, 1973/1988). For example, if 5% of the in- 
take population of a clinic has organic brain damage, then 596 is the base rate of 
brain damage in this population. Although the introduction of any valid test will 
improve predictive or diagnostic accuracy, the improvement is greatest when the 
base rates are closest to 5096. With the extreme base rates found with rare patho- 
logical conditions, however, the improvement may be negligible. Under these 
conditions, the use of a test may prove to be unjustified when the cost of its ad- 
ministration and scoring is taken into account. In a clinical situation, this cost 
would include the time of professional personnel that might otherwise be spent 
on the treatment of additional cases (Buchwald, 1965). The number of false posi- 

tives, or normal individuals incorrectly classified as pathological, would of course 
increase this overall cost in a clinical situation. 
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When the seriousness of a rare condition makes its diagnosis urgent, tests of 
moderate validity may be employed in an early stage of sequential decisions. For 
example, all cases might first be screened with an easily administered test of mod- 
erate validity. If the cutoff score is set high enough (high scores being favorable), 
there will be few false negatives but many false positives, or normals diagnosed as 
pathological. The latter can then be detected through a more intensive individ- 
ual examination given to all cases diagnosed as positive by the test. This solution 
would be appropriate, for instance, when available facilities make the intensive 
individual examination of all cases impracticable. 


Relation of Validity to Productivity. In many practical situations, what is 
wanted is an estimate of the effect of the selection test, not on percentage of 
persons exceeding the minimum performance, but on overall productivity of the 
selected persons. How does the actual level of job proficiency or criterion 
achievement of the workers hired on the basis of the test compare with that of the 
total applicant sample that would have been hired without the test? Following 
the work of Taylor and Russell, several investigators addressed themselves to this 
question. Brogden (1946b) first demonstrated that the expected increase in 
output is directly proportional to the validity of the test. Thus, the improvement 
resulting from the use of a test of validity .50 is 5096 as great as the improvement 
expected from a test of perfect validity. 

The relation between test validity and expected rise in criterion achievement 
can be readily seen in Table 6-2.” Expressing criterion scores as standard scores 
with a mean of zero and an SD of 1.00, this table gives the expected mean crite- 
rion score of workers selected with a test of given validity and with a given selec- 
tion ratio. In this context, the base output mean, corresponding to the 
performance of applicants selected without use of the test, is given in the column 
for zero validity. Using a test with zero validity is equivalent to using no test at all. 
To illustrate the use of the table, let us assume that the highest-scoring 20% of the 
applicants are hired (selection ratio = .20) by means of a test whose validity coef- 
ficient is .50. Reference to Table 6-2 shows that the mean criterion performance 
of this group is .70 SD above the expected base mean of an untested sample. With 
the same 20% selection ratio and a perfect test (validity coefficient = 1.00), the 
mean criterion score of the accepted applicants would be 1.40, just twice what it 
would be with the test of validity .50. Similar direct linear relations will be found 
if other mean criterion performances are compared within any row of Table 6-2. 
For instance, with a selection ratio of 60%, a validity of .25 yields a mean crite- 
tion score of .16, while a validity of .50 yields a mean of .32. Again, doubling the 
validity doubles the output rise. ; 

The analysis of productivity in relation to test validity was carried further by 
Schmidt and his associates (Schmidt, Hunter, McKenzie, & Muldrow, 1979). Using 
the job of computer programmer in the federal government as an illustration, 


7A table including more values for both selection ratios and validity coefficients was prepared by Naylor and 
Shine (1965). 
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these investigators estimated the dollar value of the productivity increase resulting 
from one year's use of a computer aptitude test (validity — .76) in selecting new 
hires. They arrived at their estimates through the application of decision- 
theoretic techniques to data available in the U.S. Office of Personnel Management. 
Expected gains were computed for nine selection ratios ranging from .05 to .80, and 
for five validity coefficients of prior selection procedures ranging from zero (random 
selection) to .50. 

The results indicated impressive gains in productivity from the use of the test 
under all these conditions. When use of the test was compared with random se- 
lection, the dollar gain ranged from $97.2 million for a selection ratio of .05 to 
$16.5 million for a selection ratio of .80. With prior selection validity of .50, the 
corresponding gains ranged from $33.3 million to $5.6 million. These gains would 
be spread over the expected tenure of the newly hired employees, which for com- 
puter programmers in the federal government averaged slightly under 10 years. It 
should also be noted that the estimates are based on the assumption that selection 
proceeds from the top-scoring applicants downward, until the specified selection 
ratio is reached. In other words, the procedure assumes optimum use of the selec- 
tion process. 

Using census data to assess the number of persons employed as computer pro- 
grammers in the entire population of the United States, the investigators also 
worked out corresponding estimates of the effect of using the given test on the na- 
tional level. In a still broader subsequent study, Hunter and Schmidt (1981) ex- 
plored the possible application of the same statistical techniques to the entire 
national work force, across all occupations. These preliminary estimates are ad- 
mittedly crude and tentative, and alternative procedures for such analyses tend to 
yield lower estimates (Burke & Frederick, 1984; U.S. Department of Labor, 
1983b; Weekley, Frank, O'Connor, & Peters, 1985). Nevertheless, available re- 
sults strongly suggest that effective methods for allocating people to jobs can con- 
tribute substantially to national productivity. The nature of work productivity, as 
well as the individual and organizational conditions that affect it, are receiving 
increasing attention. This is a growing area of criterion research in test valida- 
tion, which is showing noticeable theoretical and methodological advances 
(J. P. Campbell, Campbell, & Associates, 1988; Hunter, Schmidt, & Judiesch, 
1990; Raju, Burke, & Normand, 1990). 


The Concept of Utility in Decision Theory. It is characteristic of deci- 
sion theory that tests are evaluated in terms of their effectiveness in a particular 
situation. Such evaluation takes into account not only the validity of the test in 
predicting a particular criterion but also a number of other parameters, including 
base rate and selection ratio. Another important parameter is the relative utility of 
expected outcomes, the judged favorableness or unfavorableness of each outcome. 
The lack of adequate systems for assigning values to outcomes in terms of a 
uniform utility scale has been one of the chief obstacles to the application of 
decision theory. In industrial decisions, a dollar value can frequently be assigned 
to different outcomes. Even in such cases, however, certain outcomes pertaining 
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to good will, public relations, and employee morale are difficult to assess in 
monetary terms. Educational decisions must take into account institutional goals, 
social values, and other relatively intangible factors. Individual decisions, as in 
counseling, must consider the individual's preferences and value system. It has 
been repeatedly pointed out, however, that decision theory did not introduce the 
problem of values into the decision process but merely made it explicit. Value 
systems have always entered into decisions, but they were not heretofore clearly 
recognized or systematically handled. 

Advances in procedures for the assignment of values in decision models are il- 
lustrated in the productivity research by Schmidt, Hunter, and their associates 
cited in the preceding section. Although concerned with the dollar value of the 
goods and services provided by workers, the techniques developed in that research 
are applicable to the measurement of other values. Being based on the quantifica- 
tion of human judgments, the same procedures can be used with any arbitrary nu- 
merical scale, provided that the scale is clearly defined and consistently applied. It 
should be noted that the estimates required by decision models pertain only to the 
relative, not the absolute, values of different outcomes. A comprehensive techni- 
cal treatment of utility in personnel decisions can be found in Boudreau (1991).3 

In choosing a decision strategy, the goal is to maximize expected utilities across 
all outcomes. Reference to the schematic representation of a simple decision 
strategy in Figure 6-2 will help to clarify the procedure. This diagram shows the 
decision strategy illustrated in Figure 6-1, in which a single test is administered to 
a group of applicants and the decision to accept or reject an applicant is made on 
the basis of a cutoff score on the test. There are four possible outcomes, including 
valid and false acceptances and valid and false tejections. The probability of each 
outcome can be found from the number of persons in each of the four sections of 
Figure 6-1. Since there were 100 applicants in that example, these numbers di- 
vided by 100 give the probabilities of the four outcomes listed in Figure 62. 

The other data needed are the utilities of the different outcomes, expressed on 
a common scale. These hypothetical values, by whatever judgmental procedure 
obtained, are given in the last column of Figure 6-2. The expected overall utility 
of the strategy can be found by multiplying the probability of each outcome by 
the utility of the outcome, adding these products for the four outcomes, and sub- 
tracting a value corresponding to the cost of testing. This last term highlights the 
fact that a test of low validity is more likely to be retained if it is short, inexpen- 
sive, easily administered by relatively untrained personnel, and suitable for group 
administration. An individual test requiring a trained examiner or expensive 
equipment would need a higher validity to justify its use. In the hypothetical ex- 
ample illustrated in Figure 6—2, cost of testing is estimated as .10 on the utility 
scale. The total expected utility (EU) for this decision strategy is: 


EU = (.38)(1.00) + (.07)(-1.00) + (.33)(0) + (.22)(—.50) = 10 = +.10 


>For other perspectives on the assessment of utility, see Cascio and Morris (1990), Messick (1989, pp. 78-81), 
and Sadacca, Campbell, Difazio, Schultz, and White (1990). 
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Figure 6-2. A Simple Decision Strategy. 


This EU can then be compared with other EUs found with different cutoff points, 
with different tests (differing in validity and cost), or with a battery of tests, as 
well as with different decision strategies. ] 


Sequential Strategies and Adaptive Treatments. In some situations, the 
effectiveness of a test may be increased through the use of more complex decision 
strategies that take still more parameters into account. Two examples will serve to 
illustrate these possibilities. First, tests may be used to make sequential rather than 
terminal decisions. With the simple decision strategy illustrated in Figures 6-1 
and 6-2, all decisions to accept or reject are treated as terminal. Figure 6-3, on 
the other hand, shows a two-stage sequential decision. Test A could be a short 
and easily administered screening test. On the basis of performance on this test, 
individuals would be sorted into three categories, including those clearly accepted 
or rejected, as well as an intermediate “uncertain” group to be examined further 
with more intensive techniques, represented by Test B. On the basis of the 
second-stage testing, this group would be sorted into accepted and rejected 
categories. : 

Another strategy, suitable for the diagnosis of psychological disorders, is to use 
only two categories, but to test further all cases classified as positives (i.e., possibly 


4Examples of several decision strategies, showing all computational steps, can be found in J. S. Wiggins 
(1973/1988, chap. 6). X 
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Figure 6-3. A Sequential Decision Strategy. 


pathological) by the preliminary screening test. This is the strategy cited earlier 

_ in this section, in connection with the use of tests to diagnose pathological condi- 
tions with very low base rates. 

It should also be noted that many personnel decisions are in effect sequential, 


dents can be dropped from college at several stages. In such situations, it is only 
adverse selection decisions that are terminal. To be sure, incorrect selection deci- 
sions that are later rectified may be costly in terms of several value systems. But 


training procedures for workers at different aptitude levels, or the introduction of 
remedial instruction programs for students with certain educational disabilities. 
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The examples cited illustrate a few of the ways in which the concepts and ra- 
tionale of decision theory can assist in the evaluation of psychological tests for 
specific testing purposes. Essentially, decision theory has served to focus attention 
on the complexity of factors that determine the contribution a given test can 
make in a particular situation. The validity coefficient alone cannot indicate 
whether or not a test should be used, since it is only one of the factors to be con- 
sidered in evaluating the impact of the test on the efficacy of the total decision 
process. 


Moderator Variables. The validity of a test for a given criterion may vary 
among subgroups differing in personal characteristics. The classic psychometric 
model assumes that prediction errors are characteristic of the test rather than of 
the person and that these errors are randomly distributed among persons. The 
flexibility of approach ushered in by decision theory stimulated some exploration 
of prediction models involving interaction between persons and. tests. Such 
interaction would imply that the same test may be a better predictor for certain 
classes or subsets of persons than it is for others. For example, a given test may be 
a better predictor of criterion performance for men than for women, op a better 
predictor for applicants from a lower than for applicants from a higher socioeco- 
nomic level. In these examples, sex and socioeconomic level are known as moder- 
ator variables, since they moderate the validity of the test (Saunders, 1956). 

Interests and motivation may function as moderator variables. Thus, if appli- 
cants have little interest in a job, they will probably perform poorly regardless of 
their scores on relevant aptitude tests. Among such persons, the correlation be- 
tween aptitude test scores and job performance would be low. For individuals who 
are interested and highly motivated, on the other hand, the correlation between 
aptitude test score and job success may be quite high. The 1950s and 1960s wit- 
nessed a flurry of research on a wide variety of possible moderator variables. A se- 
ries of studies by Ghiselli (1956, 1960; 1963, 1968) were concerned with the 
prediction of job performance. Other investigators tested hypotheses regarding 
the role of personality variables, particularly in the prediction of college achieve- 
ment (N. J. Frederiksen & Gilbert, 1960; N. Frederiksen & Melville, 1954; 
Grooms & Endler, 1960; L. J. Stricker, 1966). 

One relatively consistent finding was a sex difference in the predictability of 
academic grades. Surveys covering several hundred correlation coefficients from 
many sources reported higher correlations for women than for men between apti- 
tude test scores and grades (Gross, Faggen, & McCarthy, 1974; Schmitt, Mellon, 
& Bylenga, 1978; Seashore, 1962). The same trend was found in high school and 
college, although the trend was more pronounced at the college level. The data 
do not indicate the reason for this sex difference in the predictability of academic 
achievement, but it may be interesting to speculate about it in the light of other 
known sex differences. If women students in general tend to be more conforming 


For a fuller discussion of the implications of decision theory for test use, see J. S. Wiggins (1973/1988, chap. 6) 
and, at a more technical level, Cronbach and Gleser (1965). 
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and more inclined to accept the values and standards of the school situation, 
their class achievement will probably depend largely on their abilities. If, on the 
other hand, men students tend to concentrate their efforts on those activities (in 
or out of school) that arouse their individual interests, these interest differences 
would introduce additional variance in their course achievement and would 
make it more difficult to predict achievement from aptitude test scores. It should 
be noted, however, that the sex differences in these validity coefficients, although 
fairly consistent, were generally small. Moreover, there is some indication that 
the differences tended to be smaller in the later studies, a finding that could re- 
flect changing attitudes of women in the late 1960s and the 1970s. 

In general, early expectations about the contribution. of moderator variables 
were not fulfilled (Abrahams & Alf, 1972; Pinder, 1973; Zedeck, 1971). Method- 
ological analyses revealed many pitfalls. Cross-validation in new samples fre- 
quently failed to corroborate initial findings. And it did not appear likely that the 
use of moderators would substantially improve the prediction that could be 
achieved through other means. In the light of present knowledge, no variable can 
be assumed to moderate validities in the absence of explicit evidence for such an 
effect. The concept of moderator variables may nevertheless have heuristic value 
in furthering the understanding of individual behavior, as in clinical case studies, 
and in suggesting fresh hypotheses that should be investigated with proper 
methodological controls. In fact, the 1980s and 1990s witnessed a revival of inter- 
est in moderator variables. Although undetected by. earlier procedures, some 
moderator variables can now be identified through more sophisticated statistical 
analyses (Morris, Sherman, & Mansfield, 1986; E. E Stone & Hollenbeck, 1989). 


COMBINING INFORMATION FROM DIFFERENT TESTS 


For the prediction of practical criteria, not one but several tests may often be re- 
quired. Most criteria are complex, the criterion measure depending on a number 
of different traits. A single test designed to measure such a criterion would thus 
have to be highly heterogeneous. It has already been pointed out, however, that a 
relatively homogeneous test, measuring largely a single trait, is more satisfactory 
because it yields less ambiguous scores (chap. 5). Hence, it is often preferable to 
use a combination of several relatively homogeneous tests, each covering a differ- 
ent aspect of the criterion, rather than a single test consisting of a hodgepodge of 
many different kinds of items. 

When a number of specially selected tests are employed together to predict a 
single criterion, they are known as a test battery. The chief problem arising in the 
use of such batteries concerns the way in which scores on the different tests are to 
be combined in arriving at a decision regarding each individual. The procedures 
followed for this purpose are of two major types, namely, multiple regression equa- 
tion and profile analysis. When tests are administered in the intensive study of in- 
dividual cases, as in clinical diagnosis, counseling, or the evaluation of high-level 
executives, it is a common practice forthe examiner to utilize test scores without 
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further statistical analysis. In preparing a case report and in making recommenda- 
tions, the examiner relies on judgment, past experience, and theoretical rationale 
to interpret score patterns and integrate findings from different tests. 


Multiple Regression Equation. The multiple regression equation yields. a 
predicted criterion score for each individual on the basis of her or his scores on all 
tests in the battery. The following regression equation illustrates the application 
of this technique to predicting a student's achievement in high school mathemat- 
ics courses from her or his scores on verbal (V), numerical (N), and reasoning (R) 
tests: 


Mathematics Achievement = .21V  .21N + .32R + 1.35 


In this example, the test scores and the criterion score are expressed as stanines. 

Any other score scale could be used for this purpose. In the above equation, the 

student's stanine score on each of the three tests is multiplied by the correspond- 

ing weight given in the equation. The sum of these products, plus a constant 

(1.35), gives the student's predicted stanine position in mathematics courses. 
Suppose that Betty Jones receives the following stanine scores: 


Verbal 6 
Numerical 4 
Reasoning 8 


The estimated mathematics achievement of this student is found as follows: 
Math. Achiev. = (.21)(6) + (.21)(4) + (.32)(8) 4.1.35 = 6.01 


Betty’s predicted stanine is approximately 6. It will be recalled (chap. 3) that a 
stanine of 5 represents average performance. Betty would thus be expected to do 
somewhat better than average in mathematics courses. Her very superior perfor- 
mance in the reasoning test (R= 8) and her above-average score on the verbal 
test (V. = 6) compensate for her poor score in speed and accuracy of computation 
(N 24). 

Specific techniques for the computation of regression equations can be found 
in texts on psychological statistics (e.g, D: C» Howell, 1997; Runyon & Haber, 
1991). Essentially, such-an:equation is based on the correlation of each test with 
the criterion, as well as on the intercorrelations among the tests. Obviously, those 
tests that correlate higher with the criterion should receive more weight. lt is 
equally important, however, to take into account the correlation of each test with 
the other tests in the battery. Tests correlating highly with one another represent 
needless duplication, since they cover to a large extent the same aspects of the 
criterion. The inclusion of two such tests will not appreciably increase the validity 
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of the entire battery, even though both tests may correlate highly with the crite- 
rion. In such a case, one of the tests would serve about as effectively as the pair; 
only one would therefore be retained in the battery. 

Even after the most serious instances of duplication have been eliminated, 
however, the tests remaining in the battery will correlate with one another to 
varying degrees. For maximum predictive value, tests that make a more nearly 
unique contribution to the total battery should receive greater weight than those 
that partly duplicate the functions of other tests. In the computation of a multiple 
regression. equation, each test is weighted in direct proportion to its correlation 
with the criterion and in inverse proportion to its correlations with the other 
tests. Thus, the highest weight will be assigned to the test with the highest valid- 
ity and the least amount of overlap with the rest of the battery. 

The validity of the entire battery can be found by computing the multiple cor- 
relation (R) between the criterion and the battery. This correlation indicates the 
highest predictive value that can be obtained from the given battery, when each 
test is given optimum weight for predicting the criterion in question. The opti- 
mum weights are those determined by the regression equation. 

It should be noted that these weights are optimum only for the particular sam- 
ple in which they were found. Because of chance errors in the correlation coeffi- 
cients used in deriving them, the regression weights may vary from sample to 
sample. Hence, the battery should be cross-validated by correlating the predicted 
criterion scores with the actual criterion scores in a new sample. Formulas are 
available for estimating the amount of shrinkage in a multiple correlation to be ex- 
pected when the regression equation is applied to a second sample, but empirical 
verification is preferable whenever possible. The larger the sample on which re- 
gression weights were derived, the smaller the shrinkage will be. 

In certain situations, the predictive validity of a battery may be improved by 
including in the regression equation a test having a zero correlation with the cri- 
terion but a high correlation with another test in the battery. This curious situa- 
tion arises when the test that is uncorrelated with the criterion acts as a suppressor 
variable to eliminate or suppress the irrelevant variance in the other test (Conger 
& Jackson, 1972). For example, reading comprehension might correlate highly 
with scores on a mathematical or a mechanical aptitude test, because the test 
problems require the understanding of complicated written instructions. If read- 
ing comprehension is irrelevant to the job behavior to be predicted, the reading 
comprehension required by the tests introduces error variance and lowers the pre- 
dictive validity of the tests. Administering a reading comprehension test and in- 
cluding scores on this test in the regression equation will eliminate this error 
variance and raise the validity of the battery. The suppressor variable appears in 
the regression equation with a negative weight. Thus, the higher an individual's 
score on reading comprehension, the more is deducted from his or her score on 
the mathematical or mechanical test. In any situation, however, the more direct 


Under certain conditions, unit weights or other alternatives may be preferable to the regression weights. For a 
brief overview of research on various weighting schemes, see Dunnette and Borman (1979). 


Validity: Measurement and Interpretation 173 


procedure of revising a test to eliminate the irrelevant variance is preferable to 
the indirect statistical elimination of such variance through a suppressor variable. 
When changes in the test are not feasible, the investigation of suppressor vari- 
ables should be considered. In such cases, the effect of the suppressor variable 
should always be checked-in a new sample. 


Profile Analysis and Cutoff Scores. In addition to the individualized pro- 
file analyses employed in clinical assessment, the pattern of test scores obtained 
with a personnel selection: battery can be evaluated in terms of multiple cutoff 
scores. Briefly, this procedure involves the establishment of a minimum cutoff 
score on each test. When this method is applied strictly, every person who falls 
below the minimum score an any one of the appropriate tests is rejected, In 
choosing appropriate tests and establishing cutoff scores for a given occupation, it 
is customary to consider more than test validities. If only tests yielding significant 
validity coefficients were taken into account, one or more essential abilities in 
which all workers in the occupation excel might be overlooked—hence the need 
for considering also those aptitudes in which workers excel as a group, even when 
individual differences beyond a certain minimum are unrelated to degree of job 
success. In some jobs, moreover, workers may be so homogeneous in a key trait 
that the range of individual differences is too narrow to yield a significant 
correlation between test scores and criterion. 

This use of the multiple cutoff method is most fully illustrated with the Gen- 
eral Aptitude Test Battery (GATB)’ developed by the United States Employment 
Service for use in the occupational counseling and referral program of its state 
employment service offices (U.S. Department of Labor, 1970). Of the nine apti- 
tude scores yielded by this battery, those to be considered for each occupation 
were chosen on the basis of criterion correlations, as well as means and standard 
deviations of workers in that occupation and the qualitative observations of job 
analysts. 

The strongest argument for the use of multiple cutoffs rather than a regression 
equation is based on the possibility of compensatory scores. In other words, a seri- 
ous deficiency in one skill may remain undetected in the person’s total battery 
score because of a high score on another test. If the deficiency occurs in a skill 
that is critical for performance of the particular job, the chosen applicant will fail. 
This situation, however, can be avoided by identifying one or more critical skills 
that may be required in a particular job and applying a cutoff only on the corre- 
sponding tests. For the majority of tests, it is usually preferable to retain the actual 
score, because the higher the test score the better will be the individual's job per- 
formance. In most jobs, there is a linear relation between predictor score and cri- 
terion performance. It should be added that extensive research with the GATB 
has provided strong evidence for such linearity of relation (Coward & Sackett, 
1990; Hartigan & Wigdor, 1989; Hawk, 1970). Under these conditions, selecting 


"This widely used test battery is discussed further in chapter 17, in connection with the industrial and organiza- 
tional use of tests. 
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individuals on the basis of the actual magnitude of their test scores yields better 
job performance than does accepting all those who exceed minimum cutoff scores. 


USE OF TESTS FOR CLASSIFICATION DECISIONS 


The Nature of Classification. Psychological tests may be used for purposes 
of selection, placement, or classification. In selection, each individual is either 
accepted or rejected. Deciding whether or not to admit a student to college, to 
hire a job applicant, or to accept an army recruit for officer training are examples 
of selection decisions. When selection is done sequentially, the earlier stages are 
often called “screening,” the term “selection” being reserved for the more 
intensive final stages. “Screening” may also be used to designate any rapid, rough 
selection process even when not followed by further selection procedures. 

Both placement and classification differ from selection in that no one is re- 
jected, or eliminated from the program. All individuals are assigned to appropri- 
ate “treatments” so as to maximize the effectiveness of outcomes. In placement, 
the assignments are based on a single score. This score may be derived from a sin- 
gle test, such as a mathematics achievement test. If a battery of tests has been ad- 
ministered, a composite score computed from a single regression equation would 
be employed. Examples of placement decisions include the sectioning of college 
freshmen into different mathematics classes on the basis of their achievement test 
scores, assigning applicants to clerical jobs requiring different levels of skill and 
responsibility, and identifying psychiatric patients as "more disturbed" and “less 
disturbed" for purposes of therapy. It is evident that in each of these decisions 
only one criterion is employed and that placement is determined by the individ- 
ual's position along a single predictor scale. 

Classification, on the other hand, always involves two or more criteria. In a mil- 
itary situation, for example, classification is a major problem, since each individ- 
ual in an available personnel pool must be assigned to the military specialty where 
he or she can serve most effectively. Classification decisions are likewise required 
in industry, when new employees are assigned to training programs for different 
kinds of jobs. Other examples include the counseling of students regarding choice 
of college curriculum (science, liberal arts, etc.), as well as field of concentration. 
Counseling is based essentially on classification, since the client is told her or his 
chances of succeeding in different academic programs or occupations. Clinical di- 
agnosis is likewise a classification problem, the major purposes of each diagnosis 
being a decision regarding the most appropriate type of therapy. 

Although placement can be done with either one or more predictors, classifica- 
tion requires multiple predictors whose validity is individually determined against 
each criterion. A classification battery requires a different regression equation for 
each criterion. Some of the tests may have weights in all the equations, although 
of different values; others may be included in only one or two equations, having 
zero or negligible weights for some of the criteria. Thus, the combination of tests 
employed out of the total battery, as well as the specific. weights, differs with the 
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particular criterion. An early example of such a classification battery is that de- 
veloped by the U.S. Air Force for assignment of personnel to different training 
programs. This battery, consisting of both paper-and-pencil and apparatus tests, 
provided stanine scores for pilots, navigators, bombardiers, and a few other air- 
crew specialties. By finding an individual’s estimated criterion scores from the dif- 
ferent regression equations, it was possible to predict whether, for example, he or 
she was better qualified as a pilot than as a navigator. A far more extensive and 
later example is Project A, the U.S. Army Selection and Classification Project 
(J. P. Campbell, 1990b). 


Differential Validity. In the evaluation of a classification battery, the major 
consideration is its differential validity against the separate criteria. The object of 
such a battery is to predict the difference in each person's performance in two or 
more jobs, training programs, or other criterion situations. Tests chosen for such a 
battery should yield very different validity coefficients for the separate criteria. In 
a two-criterion classification problem, for example, the ideal test would have a 
high correlation with one criterion and a zero correlation (or preferably a 
negative correlation) with the other criterion. General intelligence tests are 
relatively poor for classification purposes, since they predict success about equally 
well in most areas. Hence, their correlations with the criteria to be differentiated 
would be too similar. An individual scoring high on such a test would be classified 
as successful for either assignment, and it would be impossible to predict in which 
he or she would be better. In a classification battery, we need some tests that are 
good predictors of criterion A and poor predictors of criterion B, and other tests 
that are poor predictors of A and good predictors of B. 

Statistical procedures have been developed for selecting tests so as to maximize 
the differential validity of a classification battery (Brogden, 1946a, 1951, 1954; 
Horst, 1954; Mollenkopf, 1950b; Zeidner & Johnson, 1991). When the number 
of criteria is greater than two, however, the problem becomes quite complex. In 
practice, various empirical approaches have been followed to approximate the de- 
sired goals. A thorough analysis of the complexities of the classification problem 
is provided by J. P. Campbell (1990a, pp- 715-721). 


Multiple Discriminant Functions. An alternative way of handling classifi- 
cation decisions is by means of the multiple discriminant function (French, 1966). 
Essentially, this is a mathematical procedure for determining how closely the 
individual's scores on a whole set of tests approximate the scores typical of persons 
in a given occupation, curriculum, psychiatric syndrome, or other category. A 
person would then be assigned to the particular group he or she resembles most 
closely. Although the regression equation permits the prediction of degree of 
success in each field, the multiple discriminant function treats all persons in one 
category as of equal status. Group membership is the only criterion data utilized by 
this method. The discriminant function is useful when criterion scores are 
unavailable and only group membership can be ascertained. Some tests, for in- 
stance, are validated by administering them to persons in different occupations, 
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although no measure of degree of job success is available for individuals within 
each field. 

The discriminant function is also appropriate when there is a nonlinear rela- 
tion between the criterion and one or more ptedictors. For example, in certain 
personality traits there may be an optimum range fora given occupation. Individ- 
uals having either more or less of the trait in question would thus be at a disad- 
vantage. It seems reasonable to expect, for instance, that salespersons showing a 
moderately high amount of social dominance would be most likely to succeed, 
and that the chances of success would decline as scores move in either direction 
from this region. With the discriminant functions, we would tend to select indi- 
viduals falling within this optimum range. With the regression equation, on the 
other hand, the more dominant the score, the more favorable would be the pre- 
dicted outcome. If the correlation between predictor and criterion were negative, 
of course, the regression equation would yield more favorable predictions for the 
low scorers. But there is no direct way whereby an intermediate score would re- 
ceive maximum credit. Although in many instances the two techniques would 
lead to the same choices, there are situations in which persons would be differ- 
ently classified by regression equations and discriminant functions. For most 
psychological testing purposes, regression equations provide a more effective 
technique. Under certain circumstances, however, the disctiminant function is 
better suited to yield the required information. 


chapter, the effectiveness of any test in selecting personnel for a given job 
depends on the selection ratio. In classification decisions, we work with a smaller 
selection ratio and are thus able to assign better-qualified persons to each job. If 


for both jobs, the selection ratio would be 20%, since we could do no better than 
take the top 20 applicants, 

Even when predictors for the two jobs are highly correlated, so that some of the 
same appl will qualify for both jobs, there is considerable gain from the use of 


Validity: Measurement and Interpretation 177 


Mean Standard Criterion Score of Persons Placed on Two Jobs 
by Selection or Classification Strategies 


Classification: 
Selection Ratio Two predictors whose intercorrelation is 


for Each Job 


(Adapted from Brogden, 1951; p. 182.) $^ 


T 


mately 1/3 of a standard deviation above the chance value). As the correlation be- 
tween the predictors increases, the job effectiveness of the selected employees de- 
creases; but it remains better than chance even when the correlation is .80. With 
lower selection ratios, we can of course obtain better-qualified personnel. As can be 
seen in Table 6-3 however, for each selection ratio, mean job performance is better 
when applicants are chosen through classification than through selection strategies. 

A practical illustration of the advantages of classification strategies is pro- 
vided by the use of Aptitude Areas scores in the assignment of personnel to mil- 
itary occupational specialties in the U.S. Army (Maier & Fuchs, 1973). In this 
research, each Aptitude Area corresponded to a group of Army jobs requiring a 
similar pattern of aptitudes, knowledge, and interests. From a 13-test classifica- 
tion battery, combinations of three to five tests were used to find the individual's 
score in each Aptitude Area. Figure 6-4 shows the results of an investigation of 
7,500 applicants for enlistment in which the use of Aptitude Area scores was 
compared with the use of a global screening test, the Armed Forces Qualification 
Test (AFQT). It will be noted that only 5696 of this group reached or exceeded 
the 50th percentile on the AFQT, while 8096 reached or exceeded the average 
standard score of 100 on their best Aptitude Area. Thus, when individuals are 
allocated to specific jobs on the basis of the aptitudes required for each job, a 
very large majority are able to perform as well as the average of the entire sample 
or better. This apparent impossibility, in which nearly everyone could be above 
average, can be attained by capitalizing on the fact that nearly everyone excels 
in some aptitude. 

The same point was illustrated with a quite different population in a study of 
gifted children by Feldman and Bratton (1972). For demonstration purposes, 49 
children in two fifth-grade classes were evaluated on each of 19 measures, all of 
which had previously been used to select students for special programs for the 
gifted. Among these measures were global scores on a group intelligence test and 
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Figure 6-4. Percentages Scoring Above Average on AFQT and on Best Aptitude 
Area of Army Classification Battery in a Sample of 7,500 Applicants for Enlistment. 


(Data from U.S. Army Research Institute for the Behavioral and Social Sciences. Courtesy of J. E. Uhlaner.) 


on an educational achievement battery, tests of separate aptitudes and separate 
academic areas such as reading and arithmetic, a test of creative thinking, grades 
in music and art, and teachers’ nominations of the most "gifted" and the most 
"creative" children in each class. When the five highest-ranking children selected 
by each criterion were identified, they included 9296 of the group. Thus, it was 
again shown that nearly all members of a group will excel when multivariate cri- 
teria are employed. 


STATISTICAL ANALYSES OF TEST BIAS 


The Problem. If we want to use tests to predict outcomes in some future 
situation, such as an applicant's performance in college or on a job, we need tests 
with high predictive validity against the particular criterion. This requirement is 
commonly overlooked in the development of so-called culture-fair tests 
(discussed further in chaps. 9 and 12). In the effort to include in such tests only 
functions common to different cultures or subcultures, we may choose content 
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that has little relevance to any criterion we wish to predict. A better solution is to 
choose criterion-relevant content and then investigate possible population 
differences in the effectiveness of the test for its intended purpose. Validity coeffi- 
cients, regression weights, and cutoff scores may vary as a function of differences 
in the test takers’ experiential backgrounds. These values should therefore be 
checked within subgroups for which there is reason to expect such effects. These 
possible subgroup differences will be recognized as a special case of the role of 
moderator variables discussed in a preceding section. And it will be recalled that 
the search for significant and stable moderator effects has proved disappointing. 
In the present section, we shall examine specific applications of this type of 
analysis to minority populations in the United States. 

It should be noted that the predictive characteristics of test scores are less 
likely to vary among cultural groups when the test is intrinsically relevant to cri- 
terion performance. If a verbal test is employed to predict nonverbal job perfor- 
mance, a fortuitous validity may be found in one cultural group because of 
traditional associations of past experiences within that culture. In a group with a 
different experiential background, however, the validity of the test may disappear. 
On the other hand, a test that directly samples criterion behavior, or one that 
measures essential prerequisite skills, is likely to retain its validity in different 
groups. 

Since the mid-1960s, there has been a rapid accumulation of research on 
possible ethnic differences in the predictive meaning of test scores.® The large 
majority of studies conducted thus far have dealt with African-Americans, al- 
though a few have included other ethnic minorities. The problems investigated 
are generally subsumed under the heading of test bias. In this context, the term 
“bias” is employed in its well-established statistical sense, to designate constant 
or systematic error as opposed to chance error. This is the same sense in which 
we speak of a biased sample, in contrast to a random sample. The principal 
questions that have been raised regarding test bias pertain to validity coeffi- 
cients (slope bias) and to the relationship between group means on the test and 
on the criterion (intercept bias). These questions will be examined in the next 


two sections. 


Slope Bias. To facilitate an understanding of the technical aspects of test bias, 
let us begin with a scatter diagram, or bivariate distribution, such as those illus- 
trated in chapter 4 (especially Fig. 4-3). For the present purpose, the horizontal 
axis (X) represents scores on a test and the vertical axis (Y) represents criterion 
scores, such as college grade-point average or an index of job performance. It will 
be recalled that the tally marks, showing the position of each individual on both 
test and criterion, indicate the direction and general magnitude of the correlation 
between the two variables. The line of best fit drawn through these tally marks is 


ed from this voluminous literature. For an overview of the problem 


SOnly a few representative studies can be cite X 
and an analysis of its many facets, we recommend N. S. Cole and Moss (1989), Hunter, Schmidt, and Rauschen- 


berger (1977), and C: R. Reynolds and Brown (1984). 
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known as the regression line, and its equation is the regression equation. In this 
example, the regression equation would have only one predictor. The multiple 
regression equations discussed earlier in this chapter have several predictors, but 
the principle is the same. 

‘When both test and criterion scores are expressed as standard scores (SD = 
1.00), the slope of the regression line equals the correlation coefficient. For this 
reason, if a test yields a significantly different validity coefficient in the two 
groups, this difference is described as slope bias. This type of group difference is 
often designated as "differential validity." Several investigators have also em- 
ployed the term "single-group validity" to refer to a test whose validity coefficient 
reached statistical significance in one group but failed to do so in another. 

Figure 6-5 provides schematic illustrations of regression lines for several bivari- 
ate distributions.” The ellipses represent the region within which the tally marks 
of each sample would fall. Case 1 shows the bivariate distribution of two groups 
with different means in the predictor, but with identical regression lines between 
predictor and criterion. In this case, there is no test bias, since any given test score 
(X) corresponds to the identical criterion score (Y) in both groups. Case 2 illus- 
trates slope bias, with a lower validity coefficient in the minority group. 

In differential validity studies, a common difficulty arises from the fact that the 
number of cases in the minority sample is often much smaller than in the majority 
sample. Under these conditions, the same validity coefficient could be statisti- 
cally significant in the majority sample and not significant in the minority sample 
(so-called single-group validity). With 100 cases, for example, a correlation of .27 
is clearly significant at the .01 level; with 30 cases, the same correlation falls far 
short of the minimum value required for significance even at the .05 level. For 
this reason, the proper procedure to follow in such differential validation studies 
is to evaluate the difference between the two validity coefficients, rather than to 
test each separately for significance (Humphreys, 1973). By the latter procedure, 
one could easily “demonstrate” that a test is valid for, let us say, Whites and not 
valid for Blacks. All that would be needed for this purpose is a large enough group 
of Whites and a small enough group of Blacks! 

A sophisticated statistical analysis of the results of 19 published studies report- 
ing validity coefficients for Black and White employment samples casts serious 
doubt on the conclusions reached in some of the earlier studies (Schmidt, Berner, 
& Hunter, 1973). Taking into account the obtained validities and the size of sam- 
ples in each study, the investigators demonstrated that the discrepancies in valid- 
ity coefficients found between Blacks and Whites did not differ from chance 
expectancy. This conclusion was corroborated in a later, more comprehensive 
analysis covering 39 studies (Hunter, Schmidt, & Hunter, 1979). The topic of dif- 
ferential validity for minority and nonminority job applicants elicited continuing 


*?The type of analysis of test bias illustrated in Figure 6-5 has come to be known as the “Cleary model" because it 
was used by Cleary (1968) in a widely cited study of College Board Scholastic Aptitude Test scores of minority 
students. The mathematical procedures were developed by Gulliksen and Wilks (1950), and their application to 
ethnic and sex comparisons was suggested by Humphreys (1952), The diagrams in Figure 6-5 were adapted from 
a study by M. A. Gordon (1953) conducted in the U.S. Air Force under Humphreys’ direction. 
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known as the regression line, and its equation is the regression equation. In this 
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regression equations discussed earlier in this chapter have several predictors, but 
the principle is the same. 
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reason, if a test yields a significantly different validity coefficient in the two 
groups, this difference is described as slope bias. This type of group difference is 
often designated as “differential validity.” Several investigators have also em- 
ployed the term "single-group validity" to refer to a test whose validity coefficient 
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of each sample would fall. Case 1 shows the bivariate distribution of two groups 
with different means in the predictor, but with identical regression lines between 
predictor and criterion. In this case, there is no test bias, since any given test score 
(X) corresponds to the identical criterion score (Y) in both groups. Case 2 illus- 
trates slope bias, with a lower validity coefficient in the minority group. 

In differential validity studies, a common difficulty arises from the fact that the 
number of cases in the minority sample is often much smaller than in the majority 
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this reason, the proper procedure to follow in such differential validation studies 
is to evaluate the difference between the two validity coefficients, rather than to 
test each separately for significance (Humphreys, 1973). By the latter procedure, 
one could easily “demonstrate” that a test is valid for, let us say, Whites and not 
valid for Blacks. All that would be needed for this purpose is a large enough group 
of Whites and a small enough group of Blacks! 

A sophisticated statistical analysis of the results of 19 published studies report- 
ing validity coefficients for Black and White employment samples casts serious 
doubt on the conclusions reached in some of the earlier studies (Schmidt, Berner, 
& Hunter, 1973). Taking into account the obtained validities and the size of sam- 
ples in each study, the investigators demonstrated that the discrepancies in valid- 
ity coefficients found between Blacks and Whites did not differ from chance 
expectancy. This conclusion was corroborated in a later, more comprehensive 
analysis covering 39 studies (Hunter, Schmidt, & Hunter, 1979). The topic of dif- 
ferential validity for minority and nonminority job applicants elicited continuing 
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discussion over more than a decade. Some investigators observed that, because of 
methodological limitations, the results were merely inconclusive. It is noteworthy, 
however, that no evidence of differential validity was found in well-designed, 
large-scale studies on industrial samples (J. T. "Campbell, Crooks, Mahoney, & 
Rock, 1973) and Army personnel (Maier & Fuchs, 1973). In general, the method- 
ologically sounder studies proved to be those less likely to find differential validity. 

Similar results have been obtained in numerous investigations of Black and 
White college students (Breland, 1979). Validity coefficients of the College 
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Board Scholastic Aptitude Test and other college admission tests for Black stu- 
dents were generally as high as those obtained for White students, or higher. 
These relationships were found when the Black and White samples were attend- 
ing the same colleges, as well as when they were attending separate colleges. 
Working at a very different level, B. C. Mitchell (1967) studied the validities of 
two educational readiness tests against end-of-the-year achievement test scores of 
first-grade schoolchildren. In the large samples of Black and White children 
tested, validities of total scores and of subtests were very similar for the two ethnic 
groups, although tending to run somewhat higher for the Blacks. In summary, 
comprehensive surveys and critical analyses of available studies have failed to 
support the hypothesis that ability tests are less valid for Blacks than for Whites 
in predicting occupational or educational performance (Hunter, Schmidt, & 
Rauschenberger, 1984; Linn, 1978). 

Although comparable studies with other minorities are much fewer, similar re- 
sults have been obtained with Hispanic Americans for both educational and em- 
ployment testing (Breland, 1979; Duran, 1983, 1989; Pennock-Román, 1990; 
Schmidt, Pearlman, & Hunter, 1980). For Hispanic Americans, however, the in- 
terpretation of test scores is complicated by varying degrees of bilingualism and by 
attitudinal and cultural variables in the home that affect not only test perfor- 
mance but also academic and occupational achievement; under these conditions, 
we would not expect predictive validity to be affected. In a well-designed survey 
of published research on the use of tests for college admission, Duran (1983) ob- 
served that altering the tests does not offer a promising solution for these prob- 
lems among Hispanic-American students; rather, the problems need to be 
investigated and resolved directly. Nevertheless, test scores should be interpreted 
with full knowledge of the moderating background variables in individual cases. 
Furthermore, any generalization regarding Hispanic Americans should take into 
account possible differences among subgroups, such as Puerto Ricans, Mexican 
Americans, and others. 


Intercept Bias. Even when a test yields the same validity coefficients for two 
groups, it may still show intercept bias. The intercept of a regression line refers to 
the point at which the line intersects the axis. A test exhibits intercept bias if it 
systematically underpredicts or overpredicts criterion performance for a particular 
group. Let us look again at Case 1 in Figure 6—5, in which majority and minority 
samples show identical regressions. Under these conditions, there is neither slope 
nor intercept bias. Although the groups differ significantly in mean test scores, they 
show a corresponding difference in criterion performance. In Case 3, on the other 
hand, the two groups have regression lines with the same slope but different 
intercepts. In this case, the minority group (A) has a higher intercept than the 
majority group (B); that is, the minority regression line intersects the Y axis at a 
higher point than does the majority regression line. Although the validity 
coefficients computed within each group are equal, any test score (X) will 
correspond to different criterion scores in the two groups, as shown by points Y and 
Yp: The same test score thus has a different predictive meaning for the two groups. 
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Psychologists who are concerned about the possible unfairness of tests for mi- 
nority group members visualize a situation illustrated in Case 3. Note that, in this 
case, the majority excels on the test, but majority and minority perform equally 
well on the criterion. Selecting all applicants in terms of a test cutoff established 
for the majority group: would. thus discriminate unfairly against the minority. 
Under these conditions, use of the majority regression line for both groups under- 
predicts the criterion performance of minority group members. This situation is 
likely to occur when a large proportion of test variance is irrelevant to criterion 
performance and measures functions in which the majority excels the minority. A 
thorough job analysis and satisfactory test validity provide safeguards against the 
choice of such a test. 

The problem of intercept bias relates most closely to what has popularly been 
designated "test fairness." Although the terms "test fairness" and "test bias" are 
sometimes used broadly and interchangeably to cover all aspects of test use with 
cultural minorities, it has become customary to identify test fairness (or unfair- 
ness) with intercept bias. This usage was followed in the Uniform Guidelines on 
Employee Selection Procedures (1978). In the section on Faimess (14B), the defin- 
ing statement reads: 


When members of one race; sex, or ethnic group characteristically obtain lower 
scores on a selection procedure than members of another group, and the differences 
in scores are not reflected in differences in a measure of job performance, use of the 
selection procedure may unfairly deny opportunities to members of the group that 
obtains the lower scores. 


Empirical investigations of actual test use, however, have shown either no sig- 
nificant intercept bias or, more-often, a slight tendency in the opposite direction, 
as illustrated in Case 4, Figure 6-5. Here the majority group (B) has a higher in- 
tercept than the minority group (A). Under these conditions, use of the majority 
regression line and cutoff score for both groups overpredicts the criterion perfor- 
mance of minority group members and hence discriminates in favor of the minor- 
ity group. Such results have been obtained in the prediction of college grades 
(Breland, 1979; Duran, 1983; Zeidner, 1987), law school grades (Linn, 1975), 
performance in Army and Air Force training programs (M. A. Gordon, 1953; 
Maier & Fuchs, 1973; C. W. Shore & Marion, 1972), anda wide variety of indus- 
trial criteria (see Hunter et al, 1984, for survey). : 

It has been demonstrated mathematically that Case 4 (Fig. 6-5) occurs if two 
groups differ in one or more additional variables that correlate positively with both 
test and criterion (Linn & Werts; 1971; Reilly, 1973). The small overprediction is 
a statistical artifact of considering one predictor at a time. As more predictors are 
added to the battery, the overprediction decreases, a fact that has been empirically 
verified with several populations, ranging from law school students and office 
clerks to kindergarten children (see Hunter et al., 1984). ' 

It is interesting to note that the same results have been obtained when 
comparisons were made between groups classified according to educational or 
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socioeconomic level. The Army Classification Battery tended to overpredict the 
performance of high school dropouts and underpredict the performance of college 
graduates in training programs for military occupational specialties (Maier, 1972). 
Similarly, the college grades of students whose fathers were in the professions 
were underpredicted from various academic aptitude tests, while the grades of stu- 
dents from lower occupational levels tended to be overpredicted (Hewer, 1965). 
In all these studies, comparisons of higher-scoring and lower-scoring groups re- 
vealed either no significant difference in intercepts or a slight bias in favor of the 
groups scoring lower on the tests. 


Decision Models for Fair Use of Tests. Gradually, the focus of research 
began to shift from the evaluation of test bias to the design of selection strategies 
for fair test use with cultural minorities. If a selection strategy follows the 
regression (see Cleary) model illustrated in Figure 6—5, individuals will be chosen 
(for college admission, employment, etc.) solely on the basis of their predicted 
criterion scores. This strategy will maximize overall criterion performance, 
without regard to other goals of the selection process. According to this strategy, a 
fair use of tests in selection is one that is based only on the best estimate of 
criterion performance for each individual. 

Several other decision models were proposed that have the effect of selecting 
larger proportions of persons from the lower-scoring group. This effect is in line 
with the goal that is generally designated by such terms as “affirmative action” or 
the reduction of “adverse impact” in the selection process. When first introduced, 
these alternative models appeared to follow procedures quite different from those 
of the regression model.!? It was later demonstrated, however, that they could all 
be expressed as variants of one comprehensive model ( Darlington, 1971; Gross & 
Su, 1975; Petersen & Novick, 1976). The differences among the various models 
can be explained in terms of the value judgments implicit in each model. The 
role of values in decision strategies was illustrated earlier in this chapter (see Fig. 
6-2). It will be recalled that the assignment of a relative utility to each outcome 
requires a judgment of the degree of favorableness or unfavorableness of that out- 
come. These value judgments, together with the probability of each outcome, are 
used in computing the total expected utility (EU) of the decision strategy. 

The decision-theoretic analyses of fair test use demonstrated that the proposed 
models differ in their definition of fairness, insofar as they implicitly assign differ- 
ent values to acceptance and rejection of potential successes and failures within 
minority and nonminority populations. The expected utility model makes the 
under-lying social values explicit. This approach calls for an overt statement of 
utilities, which cannot be reached through statistical means but requires open dis- 
cussion and successive approximations to balance conflicting goals (N.S. Cole & 


nette and Borman (1979, pp. 491—500), Gross and Su (1975, pp. 350-351), and C. R. Reynolds (1982). Fuller 
explanations can be found in Hünter and Schmidt (1976) and Hunter et al. (1977). 
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Moss, 1989; Darlington, 1976; Messick, 1989). Among the goals to be reconciled 
are those of providing equality of opportunity for all individuals, maximizing 
success rate and productivity, increasing the demographic mix and representa- 
tiveness of the work force (at least in certain occupations), and extending prefer- 
ential treatment to groups disadvantaged by past inequities. 

Finally, it should be emphasized that statistical adjustments in test scores, cut- 
offs, and prediction formulas hold little promise as a means of correcting social 
inequities. The use of statistical manipulations that conceal score differences 
through separate subgroup- or race-norming! is likely to harm individuals by as- 
signing them to jobs or educational programs for which they lack the necessary 
prerequisite skills or knowledge. The result is likely to be poor performance, 
which will not only affect the individual adversely in self-concept and attitude 
toward work but may also contribute to the perpetuation of a stereotype that per- 
sons from some cultural or ethnic category are poor performers. More construc- 
tive solutions are suggested by other approaches earlier in this chapter. One is 
illustrated by multiple-aptitude testing and classification strategies that permit 
the fullest utilization of the diverse aptitude patterns fostered by different cul- 
tural backgrounds. A broader consideration of relevant personality traits, moti- 
vation, and attitudes also contributes to the prediction of job or educational 
performance. Another approach is through adaptive treatments, such as individ- 
ualized training programs. In order to maximize the fit of such programs to indi- 
vidual characteristics, it is essential that tests reveal as fully and accurately as 
possible the person’s present level of development in the requisite abilities. 
Comprehensive decision models provide a framework for combining a diversity 
of approaches and value systems and assessing the overall effectiveness of each 
solution. 


See, e.g., D. C. Brown (1994), L. S. Gottfredson (1994), and Sackett and Wilk (1994). 
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amiliarity with the basic concepts and techniques of item analysis, like 
knowledge about other phases of test construction, can help test users in 
their evaluation of published tests. In addition, item analysis is particularly 
relevant to the construction of informal, local tests, such as the quizzes and 
examinations prepared by teachers for classroom use. Some of the general guide- 
lines for effective item writing, as well as the simpler statistical techniques of item 
analysis, can materially improve classroom tests and are worth using even with 
small groups. 

Items can be analyzed qualitatively, in terms of their content and form, and 
quantitatively, in terms of their statistical properties. Qualitative analysis includes 
the consideration of content validity, discussed in chapter 5, as well as the evalu- 
ation of items in terms of effective item-writing procedures. Quantitative analysis 
includes principally the measurement of item difficulty and item discrimination. 
Both the validity and the reliability of any test depend ultimately on the charac- 
teristics of its items. High reliability and validity can be built into a test in ad- 
vance through item analysis. Tests can be improved through the selection, 
substitution, or revision of items. 

Item analysis makes it possible to shorten a test and at the same time to in- 
crease its validity and reliability. Other things being equal, a longer test is more 
valid and reliable than a shorter one. The effect of lengthening or shortening a 
test on the reliability coefficient was discussed in chapter 4, where the Spearman- 
Brown formula for estimating this effect was also introduced. These estimated 
changes in reliability occur when the discarded items are equivalent to those that 
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remain, or when equivalent new items are added to the test. Similar changes in 
validity will result from the deletion or addition of items of equivalent validity. 
All such estimates of change in reliability or validity refers to the lengthening or 
shortening of tests through a random selection of items, without item analysis. 
When a test is shortened by eliminating the least satisfactory items, however, the 
short test may be more valid and reliable than the original longer instrument. 


ITEM DIFFICULTY 


Percentage Passing. For most testing purposes, the difficulty of an item is 
defined in terms of the percentage (or proportion) of persons who answer it 
correctly. The easier the item, the larger this percentage will be. A word that is 
correctly defined by 70% of the standardization sample (p = .70) is regarded as 
easier than one that is correctly defined by only 15% (p — .15). It is customary to 
arrange items in order of difficulty, so that test takers begin with relatively easy 
items and proceed to items of increasing difficulty. This arrangement gives the 
test takers confidence in approaching the test and also reduces the likelihood of 
their wasting much time on items beyond their ability to the neglect of easier 
items they can correctly complete. 

In the process of test construction, a major reason for measuring item difficulty 
is to choose items of suitable difficulty level. Most standardized ability tests are de- 
signed to assess as accurately as possible each individual’s level of attainment in 
the particular ability. For this purpose, if no one passes an item, it is excess bag- 
gage in the test. The same is true of items that everyone passes. Neither of these 
types of items provides any information about individual differences. Since such 
items do not affect the variability of test scores, they contribute nothing to the re- 
liability or validity of the test. The closer the difficulty of an item approaches 1.00 
or O, the less differential information about test takers it contributes. Conversely, 
the closer the difficulty level approaches .50, the more differentiations the item 
can make. Suppose out of 100 persons, 50 pass an item and 50 fail it (p = .50). 
This item enables us to differentiate between each of those who passed it and 
each of those who failed it. We thus have 50 x 50 or 2,500 paired comparisons, or 
bits of differential information. An item passed by 70% of the persons provides 70 
x 30 or 2,100 bits of information; one passed by 90% provides 90 x 10 or 900; one 
passed by 100% provides 100 x 0 or 0. The same relationships would hold for 
harder items, passed by fewer than 50%. 

For maximum differentiation, then, it would seem that one should choose all 
items at the .50 test difficulty level. The decision is complicated, however, by the 
fact that items within a test tend to be intercorrelated. The more homogeneous the 
test, the higher these intercorrelations will be. In an extreme case, if all items were 
perfectly intercorrelated and all were of .50 difficulty level, the same 50 persons out 
of 100 would pass each item. Consequently, half of the test takers would obtain 
perfect scores and the other half zero scores. Because of item intercorrelations, it is 
best to select items with a moderate spread of difficulty level, but whose average 
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difficulty is .50. Moreover, the higher the item intercorrelations (or the correlations 
of items with total score), the wider should be the spread of item difficulty. 

Another consideration in the choice of appropriate item difficulty pertains to 
the probability of guessing in multiple-choice items. To allow for the fact that a 
certain proportion of test takers will select the correct option by guessing, the de- 
sired proportion of correct responses is set higher than would be the case for a free 
response item. For a five-option multiple-choice item, for example, the average 
proportion correct should be approximately .69 (Lord, 1952). 


Interval Scales. The percentage of persons passing an item expresses item 
difficulty in terms of an ordinal scale; that is, it correctly indicates the rank order 
or relative difficulty of items. For example, if Items 1, 2, and 3 are passed by 3096, 
2096, and 1096 of the cases, respectively, we can conclude that Item 1 is the 
easiest and Item 3 is the hardest of the three. But we cannot infer that the 
difference in difficulty between Items 1 and 2 is equal to that between Items 2 and 
3. Equal percentage differences would correspond to equal differences in difficulty 
only in a rectangular distribution, in which the cases were uniformly distributed 
throughout the range. This problem is similar to that encountered in connection 
with percentile scores, which are also based on percentages of cases. It will be 
recalled from chapter 3 that percentile scores do not represent equal units, but 
differ in magnitude from the center to the extremes of the distribution (Fig. 3-4). 
If we assume a normal distribution of the trait measured by any given item, the 
difficulty level of the item can be expressed in terms of an equal-unit interval 
scale by reference to a table of normal curve frequencies. In chapter 3 we saw, for 
example, that approximately 3496 of the cases in a normal distribution fall be- 
tween the mean and a distance of 10 in either direction (Fig. 3-3). With this in- 
formation, we can examine Figure 7-1, which shows the difficulty level of an item 
passed by 84% of the cases. Since it is the persons in the upper part of the distri- 
bution who pass and those in the lower part who fail, this 8496 includes the upper 
half (5096) plus 3496 of the cases from the lower half (50 + 34 = 84). Hence, the 
item falls 16 below the mean, as shown in Figure 7-1. An item passed by 1696 of 
the cases would fall 16 above the mean, since above this point there are 1696 of 
the cases (50 — 34 = 16). An item passed by exactly 5096 of the cases falls at the 
mean and would thus have a 0 value on this scale. The more difficult items have 
plus values, the easier items minus values. The difficulty value corresponding to 
any percentage passing can be found by reference to a normal curve frequency 
table, given in any standard statistics text and in the Study Guide for this book. 


Thurstone Absolute Scaling. Indices of item difficulty expressed as percent- 
ages or normal curve units are limited to the ability range covered by the sample 
from which they were obtained. For several purposes, however, there is need for a 
measure of item difficulty applicable across different samples varying in ability 
level. In educational achievement tests, for example, it is advantageous to be able 
to compare a child's score over several successive grades on a uniform scale. Yet it 
would obviously be impracticable to scale the items appropriate for all grades by 
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Item 1 Mean ) 


Figure 7-1, Relation between Percentage of Persons Passing an Item and the Item 
Difficulty in Normal Curve Units. 


administering them to a single group, since some items would be too difficult and 
some too easy for nearly everyone in the group. 

Another example is provided by large-scale testing programs that require many 
equivalent forms to be administered at different times, such as college admission 
programs. This problem was considered in chapter 3 as it affects the interpreta- 
tion of total scores on such instruments as the Scholastic Assessment Test. The 
solution described in that case was to use a fixed, standard reference group to de- 
fine the scale units and origin and then to convert all subsequent scores to that 
scale. This conversion requires a set of anchor or linkage items that are included 
in the tests administered to any pair of groups. The items constitute a minitest, in 
that they are representative of the whole test in content and form. A different set 
of linkage items can be shared by different pairs of groups. Each new form is 
linked to one or two earlier forms, which in turn are linked to other forms 
through a chain of minitests extending back to the original reference group. 

The same general method can be used to measure the difficulty of individual 
items on a uniform scale that is applicable to any number of interlocking groups. 
The statistical procedure, known as absolute scaling, was developed by Thurstone 
(1925, 1947a) and has been widely employed in test development (e.g., Donlon, 
1984). Essentially, this procedure involves two steps. First, we find scale values of 
items separately within each group, by converting the percentage passing each 
item into normal curve O-distances, or z values. Second, we translate all these 
scale values into corresponding values for one of the groups, chosen as a standard 
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or reference group. Any group can be chosen as a reference group, such as the first 
group tested, the youngest group, a group in the middle of the range, or some 
other convenient group. What is required is a set of common, anchor items ad- 
ministered to two or more groups and scaled within each group. 

The scale values of the same items in two (or more) groups serve to define the 
relation between the groups and permit the transmutation of all item difficulty 
values from one group to another. This relation is illustrated schematically in Fig- 
ure 7-2, showing the 6-distance (z) of the same item (i) in two adjacent groups A 
and B. The same item (i) is passed by a larger proportion of persons in Group B 
than in Group A. Its distance from the mean is accordingly smaller in Group B 
(zB) than in Group A (za). The corresponding Group A and Group B values for 
all the common items provide the basis for the conversion formula whereby all 
items given to Group B can be converted to Group A difficulty values or vice 
versa. A simple approximation can be obtained by plotting the Group A z val- 
use against the Group B z values and drawing a straight line through the points. 
This line can then be used to find z4 values for all other items administered to 
Group B. 

The same conversion procedure can be extended to any number of groups by 
working with pairs of adjacent overlapping groups. For example, in a test designed 
for Grades 1 through 8, the eighth-grade scale values can be transmuted into the 
seventh-grade scale, the seventh-grade scale into the sixth, and so on down to the 
first grade. Adjacent grade groups are usually sufficiently similar to be able to 
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Figure 7-2. Normal Curve z Values Showing Relative Difficulty of Same Item in 
Groups A and B. 
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share a large segment of the test for linkage purposes. Any one grade, however, 
would share different segments with the next higher and the next lower grades. 


Distribution of Test Scores. The difficulty of the test as a whole is, of 
course, directly dependent on the difficulty of the items that make up the test. A 
comprehensive check of the difficulty of the total test for the population for 
which it is designed is provided by the distribution of total scores. If the 
standardization sample is a representative cross-section of such a population, then 
it is generally expected that the scores will fall roughly into a normal distribution 
curve. 

Let us suppose, however, that the obtained distribution curve is not normal but 
clearly skewed, as illustrated in Figure 7-3, Parts A and B. The first of these distri- 
butions, with a piling of scores at the low end, suggests that the test has too high a 
floor for the group under consideration, lacking a sufficient number of easy items 
to discriminate properly at the lower end of the range. The result is that persons 
who would normally scatter over a considerable range obtain zero or near-zero 
scores on this test. A peak at the low end of the scale is therefore obtained. This 
artificial piling of scores is illustrated schematically in Figure 7-4, in which a nor- 
mally distributed group yields a skewed distribution on a particular test. The op- 
posite skewness is illustrated in Part B of Figure 7-3, with the scores piled at the 
upper end, a finding that suggests insufficient test ceiling. Administering a test de- 
signed for the general population to selected samples of college or graduate stu- 
dents will usually yield such a skewed distribution, a number of students obtaining 
nearly perfect scores. With such a test, it is impossible to measure individual dif- 
ferences among the more able students in the group. If more difficult items had 
been included in the test, some individuals would undoubtedly have scored 
higher than the present test permits. 

When the standardization sample yields a markedly nonnormal distribution 
on a test, the difficulty level of the test is ordinarily modified until a normal 
curve is approximated. Depending on the type of deviation from normality that 


A. Piling at Lower End of the Scale PR B. Piling at Upper End of the Scale 


Figure 7-3. Skewed Distribution Curves. 
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Figure 7-4. Skewness Resulting from Insufficient Test Floor. 


appears, easier or more difficult items may be added, other items eliminated or 
modified, the position of items in the scale altered, or the scoring weights as- 
signed to certain responses revised. Such adjustments are continued until the 
distribution becomes at least roughly normal. Under these conditions, the most 
likely score, obtained by the largest number of persons, usually corresponds to 
about 5096 correct items. To one who is unfamiliar with the methods of psycho- 
logical test construction, a 5096 score may seem shockingly low. It is sometimes 
objected, on this basis, that the examiner has set too low a standard of passing 
on the test. Or the inference is drawn that the group tested is a particularly poor 
one. Both conclusions, of course, are totally meaningless when viewed in the 
light of the procedures followed in developing psychological tests. Such tests 
are deliberately constructed and specifically modified so as to yield a mean score 
of approximately 5096 correct. Only in this way can the maximum differentia- 
tion between individuals at all ability levels be obtained with the test. With 
a mean of approximately 5096 correct items, there is the maximum opportunity 
for a normal distribution, with individual scores spreading widely at both 
extremes.! 


‘Actually, the normal curve provides finer discrimination at the ends than at the middle of the scale. Equal dis- 
crimination at all points of the scale would require a rectangular distribution. The normal curve, however, has 
an advantage if subsequent statistical analyses of scores are to be conducted, because many current statistical 
techniques assume approximate normality of distribution. For this and other reasons, it is likely that most tests 
designed for general use will continue to follow a normal-curve pattern for some time to come. 
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Relating Item Difficulty to Testing Purpose. Standardized psychological 
tests have generally been designed to elicit maximum differentiation among 
individuals at all levels. Our discussion of item difficulty thus far has been 
directed to this type of test. In the construction of tests to serve special purposes, 
however, the choice of appropriate item difficulties, as well as the optimal form of 
the distribution of test scores, depends upon the type of discrimination sought. 
Accordingly, tests designed for screening purposes should utilize items whose 
difficulty values come closest to the desired selection ratio. For example, to select 
the upper 20% of the cases, the best items are those clustering around a p of .20 
(or somewhat higher to allow for guessing). Since in a screening test no 
differentiation is required within the accepted or rejected groups, the most 
effective use of testing time is obtained when items cluster near the critical cutoff. 
It follows, for instance, that if a test is to be used to screen scholarship applicants 
from a college population, the items should be considerably more difficult than 
the average for that population. Similarly, if slow learners are being selected for a 
remedial training program, items that are much easier than average would be 
desirable. 

Another example of the choice of item difficulty levels in terms of special test- 
ing goals is to be found in mastery testing. It will be recalled (chap. 3) that mas- 
tery testing is often associated with domain-referenced testing. If the purpose of 
the test is to ascertain whether an individual has adequately mastered the basic 
essentials of a skill or whether he or she has acquired the prerequisite knowledge 
to advance to the next step in a learning program, then the items should probably 
be at the .80 or .90 p level. Under these conditions, we would expect the majority 
of those taking the examination to complete nearly all items correctly. Thus, the 
very easy items (even those passed by 100% of the cases), which are discarded as 
nondiscriminative in the usual standardized test, are the very items that would be 
included in a mastery test. Similarly, a pretest, administered prior to a learning 
unit to determine whether any of the students have already acquired the skills to 
be taught, will yield very low percentages of passing for each item. In this case, 
items with very low or even zero p values should not be discarded, since they re- 
veal what remains to be learned. 

It is apparent from these examples that the appropriate difficulty level of items 
depends upon the purpose of the test. Although in most testing situations items 
clustering around a medium difficulty (50) yield the maximum information about 
each individual's performance level, decisions about item difficulty cannot be 
made routinely, without knowing how the test scores will be used. 


ITEM DISCRIMINATION 


Choice of Criterion. ltem discrimination refers to the degree to which an 
item differentiates correctly among test takers in the behavior that the test is 
designed to measure. When the test as a whole is to be evaluated by means of 
criterion-related validation, the items may themselves be evaluated and selected 
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on the basis of their relationship to the same external criterion. This procedure 
has been followed especially in the development of certain personality and 
interest tests, discussed in chapters 13 and 14. It is also the method generally 
followed in choosing items for inclusion in biographical inventories, which 
typically cover a heterogeneous collection of background facts about the indi- 
vidual. In these types of instruments, there is no a priori basis for grading a re- 
sponse as right or wrong, or assigning it a scoring weight, except by comparison 
with the criterion status of persons who give that response. From an initial pool of 
items, those items are retained that best differentiate among persons classified in 
different criterion categories, such as various occupations or psychiatric syn- 
dromes. Frequently, the criterion groups consist of successes and failures in an 
academic course, a training program, or a type of job. 

In the domain-referenced mastery testing described in chapter 3, items can be 
evaluated by comparing the item performance of individuals who have had vary- 
ing amounts of instruction in the relevant functions (Panell & Laabs, 1979; L. A. 
Shepard, 1984). Usually the comparison is between the proportion of persons 
who give the correct item response in a preinstruction and a postinstruction 
group. Because these tests are used to determine whether individuals have 
reached a specified level of mastery, individual differences in overall performance 
on a single occasion are minimized. Hence, internal item analyses will not be 
meaningful, and an external criterion, such as amount of relevant instruction, is 
needed. 

In other types of achievement tests, as in many aptitude tests, item discrimina- 
tion is usually investigated against total score on the test itself." For educational 
achievement tests, an external criterion is not typically available. For aptitude 
tests, the increasing emphasis on construct validation makes total score an appro- 
priate criterion for item selection. In the initial stages of test development, the 
total score provides a first approximation to a measure of the ability, trait, or con- 
struct under investigation. 

Let us examine further the implications of choosing items on the basis of an 
external criterion and on the basis of total test score. The former tends to maxi- 
mize the validity of the test against the external criterion, the latter to maximize 
the internal consistency or homogeneity of the test. Under certain conditions, 
the two approaches may lead to opposite results, the items chosen on the basis of 
external validity being the very ones rejected on the basis of internal consistency. 
Let us suppose that the preliminary form of a scholastic aptitude test consists of 
100 arithmetic items and 50 vocabulary items. In order to select items from this 
initial pool by the method of internal consistency, some index of agreement be- 
tween performance on each item and total score on the 150 items will be com- 
puted. It is apparent that such an index would tend to be higher for the arithmetic 
than for the vocabulary items, since the total score is based on twice as many 


?[tem-test correlations will be somewhat inflated by the common specific and error variance in the iter: and the 
test set it is a part. Formulas are available to correct for this part-whole effect (Guilford & Fruchter, 1978, 
PP- ke . 
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arithmetic items. If it is desired to retain the 75 “best” items in the final form of 
the test, it is likely that most of these items will prove to be arithmetic problems. 
In terms of the external criterion of scholastic achievement, however, the vocab- 
ulary items might have been more valid predictors than the arithmetic items. If 
such is the case, the item analysis will have served to lower rather than raise the 
validity of the test. 

The practice of rejecting items that have low correlations with total score pro- 
vides a means of purifying or homogenizing the test. By such a procedure, the 
items with the highest average intercorrelations will be retained. This method of 
selecting items will increase test validity only when the original pool of items 
measures a single trait and when this trait is present in the criterion or construct 
to be assessed. Some types of tests, however, measure a combination of traits re- 
quired by a complex criterion. Purifying the test in such a case may reduce its cri- 
terion coverage and thus lower validity. 

The selection of items to maximize criterion-related test validity may be 
likened to the selection Of tests that will yield the highest validity for a battery. It 
will be recalled (chap. 6) that the test contributing most toward battery validity is 
one having the highest correlation with the criterion and the lowest correlation 
with the other tests in the battery. If this principle is applied to the selection of 
items, it means that the most satisfactory items are those with the highest exter- 
nal validities and the lowest coefficients of internal consistency. Thus, an item 
that has a high correlation with the external criterion but a relatively low correla- 
tion with total score would be preferred to one correlating highly with both crite- 
rion and test score, since the first item presumably measures an aspect of the 
criterion not adequately covered by the rest of the test. 

It might seem that items could be selected by the same methods used in choos- 
ing tests for inclusion in a battery. Thus, each item could be correlated with the 
external criterion and with every other item. The best items chosen by this 
method could then be weighted by means of a regression equation. Such a proce- 
dure, however, is neither feasible nor theoretically defensible. Not only would the 
computation labor be excessive, but interitem correlations are also subject to 
wide sampling fluctuation, and the resulting regression weights would be too un- 
stable to provide a satisfactory basis for item selection. An even more serious ob- 
jection, however, is that the resulting test would be so heterogeneous in content 
as to preclude meaningful interpretation of the test score. 

External validation and internal consistency are both desirable objectives of 
test construction. The relative emphasis to be placed on each varies with the na- 
ture and purpose of the test. For many testing purposes, a satisfactory compromise 
is to sort the relatively homogeneous items into separate tests or subtests, each of 
which covers a different aspect of the external criterion. Thus, breadth of cover- 
age is achieved through a variety of tests, each yielding a relatively unambiguous 
score, rather than through heterogeneity of items within a single test. By such a 
procedure, items with low indices of internal consistency would not be discarded 
but would be segregated: As a result, fairly high internal consistency would be at- 


tained within each subtest or item group. 
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Statistical Indices of Item Discrimination. Since item responses are 
generally recorded as right or wrong, the measurement of item discrimination 
usually involves a dichotomous variable (the item) and a continuous variable 
(the criterion). In certain situations, the criterion, too, may be dichotomous, as 
in graduation versus nongraduation from college, or success versus failure on a 
job. Moreover, a continuous criterion may be dichotomized for purposes of 
analysis. 

Over fifty different indices of item discrimination have been developed and 
used in test construction. One difference among them pertains to their applicabil- 
ity to dichotomous or continuous measures. Among those applicable to dichoto- 
mous variables, moreover, some assume a continuous and normal distribution of 
the underlying trait on which the dichotomy has been artificially imposed; others 
assume a true dichotomy. Another difference concerns the relation of item diffi- 
culty to discrimination. Certain indices measure item discrimination indepen- 
dently of item difficulty. Others yield higher discrimination values for items close 
to the .50 difficulty level than for those at the extremes of difficulty. 

Despite differences in procedure and assumptions, most item discrimination 
indices provide closely similar results (Oosterhof, 1976). Although the numerical 
values of the indices may differ, the items that are retained and those that are re- 
jected on the basis of different discrimination indices are largely the same. In fact, 
the variation in item discrimination data from sample to sample is generally 
greater than that among different methods. 


Use of Extreme Groups. A common practice in item analysis is to compare 
the proportion of cases that pass an item in contrasting criterion groups. When 
the criterion is measured along a continuous scale, as in the case of course grades, 
job ratings, output records, or total scores on the test, upper (U) and lower (L) 
criterion groups are selected from the extremes of the distribution. Obviously, the 
more extreme the groups the sharper will be the differentiation. But the use of 
very extreme groups, such as upper and lower 1096, would reduce the reliability of 
the results because of the small number of cases utilized. In a normal distribution, 
the optimum point at which these two conditions balance is reached with the 
upper and lower 2796 (T. L. Kelley, 1939). When the distribution is flatter than 
the normal curve, the optimum percentage is slightly greater than 27 and 
approaches 33 (Cureton, 1957b). With small groups, as in an ordinary classroom, 
the sampling error of item statistics is so large that only rough results can be 
obtained. Under these conditions, therefore, we need not be too concerned about 
the exact percentage of cases in the two contrasted groups. Any convenient 
number between 25% and 33% will serve satisfactorily. 

With the large and normally distributed samples employed in the development 
of standardized tests, it has been customary to work with the upper and lower 27% 
of the criterion distribution. Many of the tables and abacs prepared to facilitate 
the computation of item discrimination indices are based on the assumption that 
the “27% rule" has been followed. As high-speed computers become more gener- 
ally available, it is likely that the various labor-saving procedures developed to 
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facilitate item analysis will be graduall replaced by more exact and sophisticated 
methods. With computer facilities, it is better to analyze the results of the entire 
sample, rather than working with upper and lower extremes. — 


Simple Analysis with Small Groups. Because item analysis is frequently 
conducted with small groups, such as the students who have taken a classroom 
quiz, we shall consider first a simple procedure especially suitable for this 
situation. Let us suppose that in a class of 60 students we have chosen the 20 
students (3396) with the highest and the 20 with the lowest test scores. We now 
have three groups of papers which we may call the Upper (U), Middle (M), and 
Lower (L) groups. First, we need to tally the correct responses to each item given 
by students in the three groups. This can be done most readily if we list the item 
numbers in one column and prepare three other columns headed U, M, and L. As 
we come to each student's paper, we simply place a tally next to each item he or 
she answered correctly. This is done for each of the 20 papers in the U group, then 
for each of the 20 in the M group, and finally for each of the 20 in the L group. 
We are now ready to count up the tallies and record totals for each group as 
shown in Table 7-1. For illustrative purposes, the first seven items have been 
entered. A rough index of the discriminative value of each item can be found by 
subtracting the number of persons answering it correctly in the L group from the 
number answering it correctly in the U group. These U — L differences are given 
in the last column of Table 7-1. A measure of item difficulty can be obtained with 
the same data by adding the number passing each item in all three criterion 
groups (U +M +L). m» 1 ME, : £ 
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Examination of Table 7-1 reveals four questionable items that have been iden- 
tified for further consideration or for class discussion. Two items, 2 and 7, have 
been singled out because one seems to be too easy, having been passed by 56 out 
of 60 students, and the other too difficult, having been passed by only 5. Items 4 
and 5, while satisfactory with regard to difficulty level, show a negative and a zero 
discriminative value, respectively. We would also consider in this category any 
items with a very small positive U — L difference, of roughly three or less when 
groups of approximately this size are being compared. With larger groups, we 
would expect larger differences to occur by chance in a nondiscriminating item. 

The purpose of item analysis in a teacher-made test is to identify deficiencies 
either in the test or in the teaching. Discussing questionable items with the class 
is often sufficient to diagnose the problem. If the wording of the item was at fault, 
the itei.i can be revised or discarded in subsequent testing. Discussion may show, 
however, that the item was satisfactory, but the point being tested had not been 
properly understood. In that case, the topic may be reviewed and clarified. In nar- 
rowing down the source of the difficulty, it is often helpful to carry out a supple- 
mentary analysis, as shown in Table 7—2, with at least some of the items chosen 
for discussion. This tabulation gives the number of students in the U and L groups 
who chose each option in answering the particular items. 

Although Item 2 has been included in Table 7-2, there is little more we can 
learn about it by tabulating the frequency of each wrong option, since only 4 per- 
sons in the L group and none in the U group chose wrong answers. Discussion of 
the item with the students, however, may help to determine whether the item as a 
whole was too easy and therefore of little intrinsic value, whether some defect in 


Response Analysis of Individual Items 


Response Options 
Item Group 

1 2 3 4 5 

o 0 -0 20 0 

2 0 1 16 1 
2:04 10 el 0 1 
2. 16 2 0 0 

2- Been T3 11 2 

e duce FIS 11 2 
Gr IE 225 4 3 

S o | 5 ee 3 4 


Note: Correct options are in bold type. 
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its construction served to give away the right answer, or whether it is a good item 
dealing with a point that happened to have been effectively taught and well re- 
membered. In the first case, the item would probably be discarded, in the second 
it would be revised, and in the third it would be retained unchanged. 

The data on Item 4 suggest that the third option had some unsuspected impli- 
cations that led 9 of the better students to prefer it to the correct alternative. The 
point could easily be settled by asking those students to explain why they chose it. 
In Item 5, the fault seems to lie in the wording either of the stem or of the correct 
alternative, because the students who missed the item were uniformly distributed 
over the four wrong options. Item 7 is an unusually difficult one that was an- 
swered incorrectly by 15 of the U and all of the L group. The slight clustering of 
responses on incorrect option 3 suggests a superficial attractiveness of this option, 
especially for the more easily misled L group. Similarly, the lack of choices of the 
correct response (option 1) by any of the L group suggests that this alternative 
was so worded that superficially, or to the uninformed, it seemed wrong. Both of 
these features, of course, are desiderata of good test items. Class discussion might 
show that Item 7 is a good item dealing with a point that few class members had 
actually learned. 


The Index of Discrimination. If the numbers of persons passing each item in 
U and L criterion groups are expressed as percentages, the difference between these 
two percentages provides an index of item discrimination that can be interpreted 
independently of the size of the particular sample in which it was obtained. This 
index has been repeatedly described in the psychometric literature (see, e.g., Ebel; 
1979; A. P. Johnson, 1951; Oosterhof, 1976) and has been variously designed as 
U — L, ULI, ULD, or simply D. Despite its simplicity, it has been shown to agree 
quite. closely with other, more elaborate measures of item discrimination 
(Engelhart, 1965; Oosterhof, 1976). The computation of D can be illustrated by 
reference to the data previously reported in Table 7-1. First, the numbers of persons 
passing each item in the U and L groups are changed to percentages. The difference 
between the two percentages is the Index of Discrimination (D), shown in Table 
7-3. This index can have any value between +100 and —100. If all members of the 
U group and none of the L group pass an item, D equals 100. Conversely, if all 
members of the L group and none of the U group pass it, D equals —100. If the 
percentages of both groups passing an item are equal, D will be zero. ; 
As is true of several indices of item discrimination, the values of D are not in- 
dependent of item difficulty but are biased in favor of intermediate difficulty lev- 
els. Table 7-4 shows the maximum possible value of D for items with different 
percentages of correct responses. If either 100% or 0% of the total sample pass an 
item, there can be no difference in percentage passing in U and L groups; hence D 
is zero. At the other extreme, if 5096 pass an item, it would be possible for all the 
U cases and none of the L cases to pass it, thus yielding a D of 100 (100 — 0 = 
100). If 70% pass, the maximum value that D could take can be illustrated as 
follows: (U) 50/50 = 10096; (L) 20/50 = 4096; D = 100 — 40 = 60. It will be re- 
called that, for most testing purposes, items closer to the 5096 difficulty level are 
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Computation of Index of Discrimination 


Percentage Passing 


Index of 
E ee Zeg Discrimination (D) 
1 75 35 40 
100- i 80 ` 20 
TENES 45 50 
SS 80 -30 
I 3587 55 0 
m 5 45 35 
Ze l 0 25 


Note: Data from Table 7-1. 


preferable. Hence, item discrimination indices that favor this difficulty level are 
often appropriate for item selection. 


Phi Coefficient. Many indices of item discrimination report the relationship 
between item and criterion in the form of a correlation coefficient. One of these is 
the phi coefficient (ø). Computed from a fourfold table, ø is based on the propor- 
tions of cases passing and failing an item in U and L criterion groups. Like all cor- 
relation coefficients, it yields values between +1.00 and —1.00. The ø coefficient 
assumes a genuine dichotomy in both item response and criterion variable. Con- 
sequently, it is strictly applicable only to the dichotomous conditions under 
which it was obtained and cannot be generalized to any underlying relationship 
between the traits measured by item and criterion. Like the D index, ø is biased 
toward the middle difficulty levels—that is, it yields the highest possible 
correlations for dichotomies closest to a 50-50 split. 

The significance level of a ø coefficient can be teadily computed through the 
relation of o to both chi square and the normal curve ratio. Applying the latter, 


Relation of Maximum Value of D to Item Difficulty 


Percentage Passing Item Maximum Value of D 
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we can ideritify the minimum value of ø that would reach statistical significance 
at the .05 or .01 levels with the following formulas: 


NU. VN 
= 2.58 

Da, 

PVN 


In these formulas, N represents the total number of cases in both criterion groups 
combined. Thus, if there were 50 cases in U and 50 in L groups, N would be 100 
and the minimum ¢ significant at the .05 level would be 1.96 + V100 = .196. Any 
item whose ø reached or exceeded .196 would thus correlate with the criterion at 
the .05 level of significance. 


Biserial Correlation. Asa final example of a commonly used measure of item 
discrimination, we may consider the biserial correlation (ry;;), which differs from 
ø in two major respects. First, This assumes a continuous and normal distribution 
of the traits underlying both the dichotomous item response and the criterion 
variable. Second, it yields a measure of item-criterion relationship that is inde- 
pendent of item difficulty. 

Computation of the biserial correlation utilizes the mean criterion scores of 
those who pass and those who fail the item, the SD of the entire criterion group, 
and the proportion of cases of those passing and those failing the item. Formu- 
las for computing ry, are given in most statistics textbooks (e.g., Guilford & 
Fruchter, 1978, pp. 304—306). The standard error of fhis can be computed with a 
simple formula using terms already present in the ny, formula itself. It should be 
added that, when available, computers provide a direct way of obtaining both bi- 
serial correlations and their standard errors. 


ITEM RESPONSE THEORY 


Item-Test Regression. Both item difficulty and item discrimination can: be 
represented simultaneously in item-test regression graphs. For purposes of illus- 
tration, let us consider a hypothetical 12-item test calling for short answers of 
the free response type, such as the vocabulary tests in individually administered 
intelligence scales. Table 7—5 gives the proportion of persons at each total score 
level who responded correctly to each of two items. These data have been plotted 
in Figure 7-5. ra 

The difficulty level of each item can be defined as its 5096 threshold, as is cus- 
tomary in establishing sensory thresholds in psychophysics. This has been done 
in Figure 7-5 by dropping perpendiculars from the points where the two item 
curves cross the 5096 line to the horizontal axis, where the corresponding total 
scores are located. It is thereby shown that persons with total Scores of approx- 
imately 8 have a 50-50 chance of passing Item 7, as do those with scores of 
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Hypothetical Data Illustrating Item-Test Regression 


Proportion Correct 
Total Score Do See 
Item 7 Item 13 
1.00 95 
82 62 
pace 3! 
H 46 
33 00. 
10 00. 
.06 00 
.03 .00 
00 - .00 
35490 1r 00. 
i 00 i 00 


approximately 10 in the case of Item 13. The discriminative power of each item 
is indicated by the steepness of the curve: the steeper the curve, the higher the 
correlation of item performance with total score and the higher the discrimina- 
tive index. By inspection, the discriminative power of Items 7 and 13 appears 
roughly similar. 

An examination of item-score regressions, as illustrated in Figure 7-5, enables 
us to visualize how effectively an item functions. Such graphs not only combine 
information on item difficulty and item discrimination but also provide a com- 
plete picture of the relation between item performance and total score. For in- 
stance, Item 7 shows a reversal, insofar as the proportion of persons with scores of 
10 who passed the item is greater than the proportion of those with scores of 11 
who did so. If these results were based on a small sample, this reversal would prob- 
ably be negligible; but it illustrates the kind of information that such an analysis 
of item data can bring to light. 

For purposes of mathematical treatment and of precise evaluation and selec- 
tion of items, however, such a graph is crude and quite limited. This approach 
has served as the starting point for the development of highly sophisticated 
and complicated types of item analysis, which began to attract attention in the 
1970s and early 1980s. A reason for their growing popularity is undoubtedly to 
be found in the rapidly expanding availability of high-speed computers, with- 
out which the computational tasks required by such analyses would be prohib- 
itive. With the preparation of computer programs for several proposed models 

"of item analysis, the practical application of these refined procedures became 
feasible. Major features of this approach will be described in the following 
sections. 


Proportion of Correct Answers 


Figure 7-5. ltem-Test Regression for Items 7 and 11. 
(Data from Table 7-5.) yu 


Item Response Theory (IRT): Basic Features." "The mathematical ` ` 
approach under consideration, item. response theory, has also been designated ` ` 
latent trait theory and item characteristic «curve (ICC) theory. A fundamental ` 
feature of this approach is that item performance is related to the estimated — 
amount of the respondent's “latent trait,” symbolized by the Greek letter 0 
(theta). As used in this context, a “latent trait” refers to a statistical construct; 
there is no implication that it is a psychological or physiological entity with an 
independent existence. In cognitive tests, the latent trait is generally called the 


3For a clear overview of IRT methodology and its applications, see Hambleton et al. (1991). More technical sur- 
veys and critical evaluations can be found in Hambleton (1989) and Drasgow and Hulin (1990). For the intro- 
duction of IRT into psychometrics, see Lord (1980), D. J. Weiss (1983), and D. J. Weiss and Davidson (1981). 
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ability measured by the test. Total score on the test is often taken as an initial 
estimate of that ability. 

Item characteristic curves are plotted from mathematically derived functions, 
rather than from the empirical data used in item-test regression curves. Different 
IRT models utilize different mathematical functions, based on diverse sets of 
assumptions. Some models use normal ogive functions (i.e., cumulative normal 
distributions); others use logistic functions, which take advantage of some 
mathematically convenient properties of logarithmic relations. In general, the re- 
sults obtained with the various models are substantially similar, provided their 
assumptions are met in the particular situations. Figure 7-6 shows item 
characteristic curves for three hypothetical items. The horizontal axis gives the 
ability scale (6), estimated from total test score and other information available 
about test responses in a particular sample. The vertical axis gives P; (8), the prob- 
ability of a correct response to Item i as a function of the person's position on the 
ability scale (6). This probability is derived from data on the proportion of persons 
at different ability levels who passed the item. 

In the complete, three-parameter model, each ICC is described by three para- 
meters derived mathematically from the empirical data. The item discrimination 
parameter (a;) indicates the slope of the curve. It is inversely related to the dis- 
tance one must travel along the ability continuum (6) in order to increase P, (6). 
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The higher the value of a;, the steeper the slope. In Figure 7-6, Items 1 and 2 
have the same ai, or discriminative value; Item 3 has a lower aj, since its curve 
rises more slowly. The item difficulty parameter (b;) corresponds to the location 
on the ability axis at which the probability of a correct response, P; (6), is .50. It 
will be seen that Items 2 and 3 have the same b;, while Item 1 is easier, requiring 
less ability for a .50 probability of correct response. IRT models designed to deal 
with multiple-choice items often include a third parameter, the so-called guessing 
parameter (c;).* It represents the probability that a correct response occurs by 
chance. In a multiple-choice item, even examinees at very low ability levels have 
a higher than zero probability of giving a correct response. This is illustrated by 
Item 3 in Figure 7-6, whose lower asymptote is considerably above zero. 

Estimates of both item parameters and ability are typically computed by itera- 
tive or successive-approximation procedures; the approximations are repeated 
until the values stabilize. In addition to yielding mathematically refined indices of 
item difficulty and item discrimination, IRT techniques provide several other 
benefits. An important feature of this approach is its treatment of reliability and 
error of measurement through item information functions. These functions, which 
are computed for each item, provide a sound basis for choosing items in test con- 
struction. The item information function takes all item parameters into account 
and shows the measurement efficiency of the item at different ability levels. 

The most widely publicized contribution of IRT models pertains to the sample- 
free nature of their results, technically described as invariance of item parameters. It 
is a basic concept of IRT theory that item parameters should be invariant when 
computed in groups differing in ability. This means that a uniform scale of mea- 
surement can be provided for use in different groups. It also means that groups as 
well as individuals can be tested with a different set of items, appropriate to their 
ability levels, and their scores will be directly comparable. Each person’s test score 
is based not only on the number but also on the preestablished difficulty levels, of 
items answered correctly. 

When many different samples are to be tested, one procedure is to work with a 
large item pool or item bank that has been precalibrated on a large random sam- 
ple. When the ability range is very wide, as in achievement test series overlapping 
many school grades, it is necessary to use common items (variously called anchor, 
linkage, or calibration items), in order to bridge the gaps across groups. Once the 
items in the pool are calibrated, any subset of items can be administered to any 
group or individual, and the resulting scores will be comparable. 


Other IRT Models. In the preceding section, we examined a three-parameter 
model. Two-parameter models, which omit the chance-response parameter (cj), 
are appropriate when the effects of guessing on test performance can be regarded 
as negligible. A one-parameter model, based only on the difficulty (b;) of a set of 
items, was developed by Rasch (1966; see also Andersen, 1983) and has been 


be called simply a lower-asymptote or chance parameter for the ICC, be- 


4 p 7 
Some investigators recommend that c; whereas guessing is a function of ability. 


cause three-parameter models treat c; as independent of ability, 
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adapted and strongly championed by some investigators (e.g., Wright, 1977; 
Wright & Stone, 1979). This model is based on the assumption that both 
guessing and item differences in discrimination are negligible. In constructing 
tests, the proponents of the Rasch model frequently discard those items that do 
not meet these assumptions. It has been argued, too, that IRT models are quite 
"robust," in the statistical sense, meaning that various assumptions may be 
violated within limits, without distortion of results. This, of course, is a matter for 
empirical verification. 

The models considered thus far assume unidimensionality of the test; that is, it is 
assumed that item responses can be attributed to a single trait. In general, the uni- 
dimensionality assumption can be adequately met if test performance depends on 
a single predominant trait, even though other traits may affect performance in 
minor ways. More general models, applicable to multidimensional tests, have also 
been designed, but the computational procedures are more laborious. Still other 
variants have been developed to handle graded (rather than dichotomous or 
right-wrong) item responses (Samejima, 1969), or to analyze the different re- 
sponse options in multiple-choice items (Bock, 1972). 


Current Status. The relative merits of alternative models of IRT are still 
widely debated. There is need for much more checking of mathematically derived 
values, not only with artificial data and computer simulation but also with live 
data. The invariance of item parameters needs especially to be widely investi- 
gated in realistic situations. For example, the same items may involve a different 
mix of abilities when performed by persons with different experiential back- 
grounds or by the same person at different stages of learning. From another angle, 
an impressive number and variety of computer programs are already available for 
IRT data (see, e.g., Hambleton, 1989, pp. 171—172); but the programs are in a 
constant state of flux through reevaluation, revisions, and replacements.? 

While the theoretical and methodological diversities are being worked out, 
there is nevertheless an increasing use of IRT procedures in practical test develop- 
ment. The techniques of IRT are rapidly being incorporated in newly constructed 
tests and in revised editions of widely used test batteries developed by commercial 
test publishers. Examples include the California Achievement Tests and the 
Comprehensive Tests of Basic Skills, as well as the Differential Ability Scales de- 
scribed in chapter 8. IRT is particularly appropriate for certain newly emerging 
types of testing, such as computerized adaptive testing (CAT), discussed in chap- 
ter 10. In such testing, each test taker may respond to a different set of items, but 
all are scored on a uniform scale (Wainer et al., 1990). A major application of 
IRT is in the long-term project to develop a CAT version of the Armed Services 
Vocational Aptitude Battery (Wiskoff & Schratz, 1989). 


5An outstanding current example is the ASCAL program for two- and three-parameter logistic IRT calibration, 
distributed by Assessment Systems Corporation (address in Appendix B). Also relevant is the recent develop- 
ment of a generalized linear item-response theory (GLIRT), from which different IRT models can be derived 
and which can accommodate different item formats (| Mellenbergh, 1994). 
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ITEM ANALYSIS OF SPEEDED TESTS 


Whether or not speed is relevant to the function being measured, item indices 
computed from a speeded test may be misleading. Except for items that all or 
nearly all examinees have had time to attempt, the item indices found from a 
speed test will reflect the position of the item in the test rather than its intrinsic 
difficulty or discriminative power. Items that appear late in the test will be passed 
by a relatively small percentage of the total sample, because only a few persons 
have time to reach these items. Regardless of how easy the item may be, if it oc- 
curs late in a speeded test, it will appear difficult. Even if the item merely asked for 
one’s name, the percentage of persons who passed it might be very low if the item 
were placed toward the end of a speeded test. 

Similarly, item discrimination indices tend to be overestimated for those items 
that have not been reached by all test takers. Because the more proficient individ- 
uals tend to work faster, they are more likely to reach one of the later items in a 
speed test. Thus, regardless of the nature of the item itself, some correlation be- 
tween the item and the criterion would be obtained if the item occurred late in a 
speed test. 

To avoid some of these difficulties, we could limit the analysis of each item to 
those persons who have reached the item. This is not a completely satisfactory so- 
lution, however, unless the number of persons failing to reach the item is small. 
Such a procedure would involve the use of a rapidly shrinking number of cases 
and would thus render the results on the later items quite unreliable. Moreover, 
the persons on whom the later items are analyzed would probably constitute a se- 
lected sample and hence would not be comparable to the larger samples used for 
the earlier items. As has already been pointed out, the faster performers tend also 
to be the more proficient. The later items would thus be analyzed on a superior 
sample of individuals. One effect of such a selective factor would be to lower the 
apparent difficulty level of the later items, since the percentage passing would be 
greater in the selected superior group than in the entire sample. It will be noted 
that this is the opposite error from that introduced when the percentage passing is 
computed in terms of the entire sample. In that case, the apparent difficulty of 
items is spuriously raised. x 

The effect of the above procedure on indices of item discrimination is less ob- 
vious, but nonetheless real. It has been observed, for example, that some low- 
scoring test takers tend to hurry through the test, marking items almost at random 
in their effort to try all items within the time allowed. This tendency is much less 
common among high-scoring test takers. ‘Asa result, the sample on which a late- 
appearing item is analyzed is likely to consist of some very poor respondents, who 
will perform no better than chance on the item, and a larger number of very pro- 
ficient and fast respondents, who are likely to answer the item correctly. In such a 
group, the item-criterion correlation will probably be higher than it would be in a 
more representative sample. In the absence of such random respondents, on the 
other hand, the sample on which the later items are analyzed will cover a rela- 


tively narrow range of ability. Under these conditions, the discrimination indices 
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of the later items will tend to be lower than they would be if computed on the en- 
tire unselected sample. 

The anticipated effects of speed on indices of item difficulty and item discrimi- 
nation have been empirically verified, both when item statistics are computed 
with the entire sample (Wesman, 1949) and when they are computed with only 
those persons who attempt the item (Mollenkopf, 1950a). In the latter study, 
comparable groups of high school students were given two forms of a verbal test 
and two forms of a mathematics test. Each of the two forms contained the same 
items as the other, but items occurring early in one form were placed late in the 
other. Each form was administered with a short time limit (speed conditions) and 
with a very liberal time limit (power conditions). Various intercomparisons were 
thus possible between forms and timing conditions. The results clearly showed 
that the position of an item in the speed tests affected its indices of difficulty and 
discrimination. When the same item occurred later in a speeded test, it was 
passed by a greater percentage of those attempting it, and it yielded a higher item- 
criterion correlation. 

The difficulties encountered in the item analysis of speeded tests are funda- 
mentally similar to those discussed in chapter 4 in connection with the reliability 
of speeded tests. Various solutions, both empirical and statistical, have been de- 
veloped for meeting these difficulties. One empirical solution is to administer the 
test with a long time limit to the group on which item analysis is to be carried out. 
This solution is satisfactory provided that speed itself is not an important aspect 
of the ability to be measured by the test. Apart from the technical problems pre- 
sented by specific tests, however, it is well to keep in mind that item analysis data 
obtained with speeded tests are suspect and call for careful scrutiny. 


CROSS-VALIDATION 


Meaning of Cross-Validation. It is essential that test validity be computed 
on a different sample of persons from that on which the items were selected. This 
independent determination of the validity of the entire test is known as cross- 
validation. Any validity coefficient computed on the same sample that was used 
for item-selection purposes will capitalize on random sampling errors within that 
particular sample and will consequently be spuriously high. In fact, a high validity 
coefficient could result under such circumstances even when the test has no 
validity at all in predicting the particular criterion. 

Let us suppose that out of a sample of 100 medical students, the 30 with the 
highest and the 30 with the lowest medical school grades have been chosen to 
represent contrasted criterion groups. If, now, these two groups are compared in a 
number of traits actually irrelevant to success in medical school, certain chance 
differences will undoubtedly be found. Thus, there might be an excess of private- 
school graduates and of red-haired persons within the upper criterion group. If we 
were to assign each individual a score by crediting her or him with one point for 
private-school graduation and one point for red hair, the means of such scores 
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would undoubtedly be higher in the upper than in the lower criterion group. This 
is not evidence for the validity of the predictors, however, since such a validation 
process is based on a circular argument. The two predictors were chosen in the 
first place on the basis of the chance variations that characterized this particular 
sample. And the same chance differences are operating to produce the mean dif- 
ferences in total score. When tested in another sample, however, the chance dif- 
ferences in frequency of private-school graduation and red hair are likely to 
disappear or be reversed. Consequently, the validity of the scores will collapse. 


An Empirical Example. A classic demonstration of the need for cross- 
validation is provided by an early investigation conducted with the Rorschach 
inkblot test (Kurtz, 1948). In an attempt to determine whether the Rorschach 
could be of any help in selecting sales managers for life insurance agencies, this 
test was administered to 80 such managers. These managers had been carefully 
chosen from several hundred employed by eight life insurance companies, so as to 
represent an upper criterion group of 42 considered very satisfactory by their 
respective companies, and a lower criterion group of 38 considered unsatisfactory. 
The 80 test records were studied by a Rorschach expert, who selected a set of 32 
signs, or response characteristics, occurring more frequently in one criterion 
group than in the other. Signs found more often in the upper criterion group were 
scored +1 if present and 0 if absent; those more common in the lower group were 
scored —1 or 0. Since there were 16 signs of each type, total scores could range 
theoretically from —16 to +16. 

When the scoring key based on these 32 signs was reapplied to the original 
group of 80 persons, 79 of the 80 were correctly classified as being in the upper or 
lower group. The correlation between test score and criterion would thus have 
been close to 1.00. However, when the test was cross-validated on a second com- 
parable sample of 41 managers, 21 in the upper and 20 in the lower group, the va- 
lidity coefficient dropped to a negligible .02. It was thus apparent that the key 
developed in the first sample had no validity for selecting such personnel. 


An Example with Chance Data. That the use of a single sample for item 
selection and test validation can produce a completely spurious validity coeffi- 
cient under pure chance conditions was vividly demonstrated in a classic study by 
Cureton (1950). The criterion to be predicted was the grade-point average of 


of each item or response sign in that students pe 
procedure followed in generating these chance scores, Cureton facetiously named 


the test the “B-Projective Psychokinesis Test.” 
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An item analysis was then conducted, with each student's grade-point average 
as the criterion. On this basis, 24 "items" were selected out of the 85. Of these, 9 
occurred more frequently among the students with an average grade of B or better 
and received a weight of +1; 15 occurred more frequently among the students 
with an average grade below B and received a weight of —1. The sum of these item 
weights constituted the total score for each student. Despite the known chance 
derivation of these “test scores,” their correlation with the grade criterion in the 
original group of 29 students proved to be .82. Such a finding is similar to that ob- 
tained with the Rorschach scores in the previously cited study. In both instances, 
the apparent correspondence between test score and criterion resulted from the 
utilization of the same chance differences both in selecting items and in deter- 
mining validity of total test scores. 


Conditions Affecting Validity Shrinkage. The amount of shrinkage of a 
validity coefficient in cross-validation depends in part on the size of the original 
item pool and the proportion of items retained. When the number of original 
items is large and the proportion retained is small, there is more opportunity to 
capitalize on chance differences and thus obtain a spuriously high validity coeffi- 
cient. Another condition affecting amount of shrinkage in cross-validation is size 
of sample. Since spuriously high validity in the initial sample results from an ac- 
cumulation of sampling errors, smaller groups (which yield larger sampling errors) 
exhibit greater validity shrinkage. 

If items are chosen on the basis of previously formulated hypotheses, derived 
from psychological theory or from past experience with the criterion, validity 
shrinkage in cross-validation will be minimized. For example, if a particular hy- 
pothesis required that the answer "Yes" be more frequent among successful stu- 
dents, then the item would not be retained if a significantly larger number of "Yes" 
answers were given by the unsuccessful students. The opposite, blindly empirical 
approach would be illustrated by assembling a miscellaneous set of questions with 
little regard to their relevance to the criterion behavior, and then retaining all 
items yielding significant positive or negative correlations with the criterion. 
Under the latter circumstances, we would expect much more shrinkage than 
under the former. A well-designed demonstration of the differences in validity 
shrinkage that actually occur when items are chosen by rational or by empirical 
strategies is provided by T. W. Mitchell and Klimoski (1986). In summary, shrink- 
age of test validity in cross-validation will be greatest when samples are small, the 
initial item pool is large, the proportion of items retained is small, and items are 
assembled without previously formulated rationale. 


DIFFERENTIAL ITEM FUNCTIONING 


Statistical Procedures. As one aspect of the investigation of test bias for 
minority groups, the analysis of “item bias” has received increasing attention. 
Such analysis is concerned essentially with the relative difficulty of individual test 
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items for groups with dissimilar cultural or experiential backgrounds. In psycho- 
metric terminology, this area of item analysis is known as differential item 
functioning (DIF). It seeks to identify items for which equally able persons from 
different cultural groups have different probabilities of success. For this purpose, 
equal ability means equal with regard to the construct that the test is designed to 
assess, or the criterion behavior it is designed to predict. Many methods have 
been developed for identifying such differentially functioning items, including 
both statistical and judgmental procedures (Berk, 1982; Camilli & Shepard, 
1994; Hambleton & Rogers, 1989; P. W. Holland & Thayer, 1988; P. W. Holland 
& Wainer, 1993; Osterlind, 1983; C. R. Reynolds & Brown, 1984). 

A major problem is that demographic (or other experiential) group differences 
in item difficulty are closely associated with mean differences in performance 
level on the test as a whole. Consequently, items that have good discriminative 
value in terms of total score are likely to appear as “biased” items and be dis- 
carded. Several procedures have been employed to control for such differences in 
total score. With increasing access to computers, one of the most promising meth- 
ods is based on item response theory (IRT). These procedures are especially ap- 
propriate when large samples are available. As shown earlier in this chapter, the 
item characteristic curves (ICC) for each item indicate the probability of correct 
response in relation to the ability scale for the test (Fig. 7-6). By comparing the 
ICC for the same item in any two groups, we can identify those items with sub- 
stantial differential functioning relative to the overall test performance of the 
groups, which is expressed on a uniform scale. This comparison is illustrated by 
the two items in Figure 7-7. It can be seen that, for Item 1, the ICCs for Groups 
A and B are quite dissimilar, while for Item 2 they are closely similar. For each 
item, the area between the two ICCs may be used to identify the range of tested 
ability within which there is evidence of DIE Once DIF items are identified, by 
whatever procedure is employed, the next step is to inquire into the nature and 
source of the difference. The answer to this question determines the disposition of 
the item. For this purpose, judgmental procedures may be needed, possibly in 
combination with further statistical analysis. 


Judgmental Procedures. There is no single “best method” of item-bias analy- 
sis for all purposes. Because different methods provide somewhat different kinds 
of information, it is desirable to employ a combination of methods. The appropri- 
ate combination depends upon the anticipated uses of the test and upon the type 
of inferences to be drawn from the scores. Generally, the best combination in- 
cludes some statistical and some judgmental procedures. 
When properly applied, judgmental procedures can provide useful information 
not otherwise obtainable (Scheuneman, 1982; Tittle, 1982). Judgmental analysis 
is especially appropriate at the initial and final stages of test construction, to pre- 
cede and follow statistical analyses. It is commonly introduced early in the devel- 
opment of a test to screen out content that is potentially offensive or demeaning 
to minorities, or content that perpetuates cultural stereotypes with regard to oc- 
cupational or other societal roles. For this purpose, major test publishers regularly 
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Figure 7-7. Irt Chiltactlirisie Curves (ICC) for Two Items, Meada Large 
and Small Amounts of Differential Item Functioning (DIF). 
(Graphs adapted from Pashley, 1992, by permission.) « "tts 
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use preliminary item reviews by their own staff and by representative outside con- 
sultants (Berk, 1982, chap. 9). Such reviews also help to identify test content that 
may be culturally restricted and hence unfamiliar to particular populations of test 
takers. It should be noted, however, that judgmental reviews have not generally 
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proved successful in predicting the relative difficulty or discriminative power of 
items for different populations (Plake, 1980; Sandoval & Miille, 1980; Scheune- 
man, 1982). Statistical analyses of empirical results are needed for this purpose. 

On the other hand, not all deviant items identified by statistical procedures 
can be assumed to be biased. The results need to be interpreted through a second 
and different kind of judgmental review. At this stage, the items are examined for 
possible sources of deviancy. The statistical outliers do not usually reveal any 
common feature or any obvious reason for deviancy; each item needs to be con- 
sidered individually. A few outliers may simply reflect statistical artifacts arising 
from the particular procedure. In other cases, the deviant performance may result 
from any of a wide variety of conditions that have different implications for test 
interpretation. The proper evaluation of such deviant items calls for familiarity 
with both the content domain of the test and the experiential differences among 
test-taker populations. 

A possible reason for deviancy is that an item does not measure the same con- 
struct in different groups. For example, an analogies term may measure verbal rea- 
soning in one group but vocabulary knowledge in another, because it contains a 
key word unfamiliar to many members of a particular minority. Similarly, an arith- 
metic problem may measure mathematical ability in one group but ability to un- 
derstand complex verbal statements in another. In these two examples, the 
missing knowledge was not relevant to the construct measured by the test as a 
whole. Suppose, however, that mathematical items involving decimals are found 
to be relatively more difficult for members of a particular group. This difference is 
relevant to the construct of mathematical ability. Hence, such outliers are not, in 
this sense, biased items. 

When deviant items are identified statistically, the source of the deviancy may 
be in part clarified through further statistical procedures, such as an analysis of 
wrong options chosen in a multiple-choice item. These supplementary analyses, 
in combination with judgmental reviews, should provide the basis for appropriate 
action. The deviant item may be discarded, revised as a whole, or altered in any 
flawed portion; the test instructions may be expanded or clarified; or the original 
item may be retained, after reconsideration in the light of the test specifications. 
The item analysis may even suggest a reevaluation of the test specifications them- 
selves, The result could be either a change in the test specifications or a clarifica- 
tion regarding the particular inferences than can be drawn from the test scores. 


A Notorious Case of Misuse of DIF. Considerable publicity was given to 
what was probably the first case considered by a court of law that was based 
primarily on item analysis in the evaluation of “test bias.” This case became 
widely known as the “Golden Rule case,” because it related to the use of a 
licensing examination in the hiring of employees by the Golden Rule Insurance 
Company. The similarity between the name of the company and the common use 
of the same two words in a very different sense probably added to the general 
confusion. The final settlement of the case was based on a comparison of group 
differences in sheer percentage of persons passing an item, with no effort to 
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equate groups in any measure of the ability the test was designed to assess, nor any 
consideration of the validity of the items for the intended purpose of the test. The 
decision was clearly in complete violation of the concept of differential item 
functioning and would be likely to eliminate the very items that were the best 
predictors of job performance. 

Because of its wide misunderstanding and its possible precedence-setting influ- 
ence on both occupational and educational use of tests, the fallacy in the 
“Golden Rule" decision was examined from several angles in critical evaluations 
(e.g, Lim & Drasgow, 1990), including an official statement by the American 
Psychological Association. The decision was also the subject of a symposium at 
an APA annual convention, most of whose papers were subsequently published in 
a special journal issue of Educational and Psychological Measurement: Issues and 
Practices (Bond, 1987; Faggen, 1987; Linn & Drasgow, 1987). A consideration of 
this notorious case highlights the potential practical hazards in efforts to assess 
"test bias" through superficial and incomplete indicators. 


EXPLORATIONS IN ITEM DEVELOPMENT 


The rapid expansion of computer utilization in the 1980s and 1990s, in combina- 
tion with progress in cognitive psychology, stimulated extensive research on in- 
novative approaches to item construction. Traditionally, item writing has been 
more an art than a science. Even under the best conditions, item writers are given 
instructions that specify little more than item form and content coverage. It is 
still common practice to rely on empirical pretesting of items to assess their diffi- 
culty level and discriminative power. Is there any way to predict these item statis- 
tics, before pretesting, simply from an analysis of the physical or semantic 
properties of the stimuli? Better yet, can items be constructed so as to have the de- 
sired difficulty and discriminative values? Can systematic manipulation of stimu- 
lus characteristics predetermine the cognitive demands of test items? These are 
the questions that are being investigated in ongoing research, through both ex- 
perimental and mathematical procedures (Bejar, 1985, 1991; Carroll, 1987; Em- 
bretson, 1985a, 1985b, 1991, 1994, 1995; Freedle, 1990). 

The cognitive demands of test stimuli can be explored through the techniques 
of task decomposition developed within cognitive psychology. By these proce- 
dures, the relationships of different item features to speed and error of perfor- 
mance can be investigated. Several such studies have been conducted with spatial 
items (Embretson, 1994; Pellegrino, Mumaw, & Shute, 1985). For example, the 
stimuli presented in a spatial analogies test can be classified with respect to: 
(1) complexity, or number of separate elements that must be identified (e.g. 
shape, size, position); and (2) transformations, or number of ways the stimulus is 
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altered within the pair to be evaluated. In certain types of spatial visualization 
problems, which require the test taker to choose the parts that can be assembled 
to form a given whole, the parts may be merely separated, or displaced, or rotated, 
or altered in a combination of these ways. 

Other studies have been concerned with the semantic characteristics of verbal 
stimuli. For example, in verbal reasoning tests, items can be constructed according 
to known logical principles (Colberg, 1985; Colberg, Nester, & Trattner, 1985; 
Scheuneman, Geritz, & Embretson, 1991; K. Sheehan & Mislevy, 1989; Shye, 
1988). Such a procedure could ensure that only one of the response options is truly 
correct and that different logical relations are represented in a predetermined pro- 
portion in the item sample. This procedure would also make possible the ma- 
nipulation of the logical complexity of the item, whose relation to difficulty level 
can then be empirically investigated. Some researchers have experimented with 
the construction of letter series designed to test inductive reasoning (Butterfield et 
al., 1985). A detailed set of rules was first developed for the systematic construc- 
tion of such letter series. Hypotheses were then formulated about what people do 
in trying to understand a series. The hypotheses were tested through empirical 
studies of the difficulty of series completion items. 

Embretson (1994) presents a thoroughgoing analysis and updating of the 
process of item development. This process begins with a definition of the con- 
structs to be assessed and proceeds to the design of a cognitive model for the test. 
The detailed features of this cognitive model provide the specifications for item 
writing. Empirical validation of items follows to ascertain how well the items ac- 
tually fit the cognitive model in its practical applications. The complete proce- 
dure is illustrated in the development of the Spatial Learning Ability Test, which 
measures not only initial spatial ability but also its modifiability following stan- 
dardized instruction. 

Research on the prediction of item difficulty from the physical and semantic 
properties of the stimuli not only facilitates the production of effective tests by 
item writers but may also lead to the construction of items by computers. It is cer- 
tainly possible to incorporate detailed item specifications in computer programs 
(see, e.g., Butterfield et al., 1985; Embretson, 1994). Undoubtedly, the potential 
advantages of these evolving test construction procedures are impressive. We 
must, however, guard against expecting too much from any one approach. It is 
quite likely, for example, that a test may measure some clearly identified cognitive 
constructs fully and effectively and yet not have high predictive validity for cer- 
tain important practical uses. For this reason it is essential to consider both as- 
pects of construct validation, which Embretson (1983) designates as construct 
representation and nomothetic span. Task decomposition provides information 
on construct representation; nomothetic span requires the investigation of the re- 
lations of test scores to a network of other, external variables, including criterion 
measures. A second caution against overgeneralization pertains to the need for 
relevant content knowledge in order to perform effectively in any subject-matter 
area or field of expertise. Processes are often linked to content; they cannot be ef- 


fectively evaluated in the absence of the appropriate content. 
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In conclusion, the innovative procedures cited in this section, when properly 
applied, can contribute significantly to the systematic and controlled construc- 
tion of test items. Moreover, by identifying the constructs measured by a test, 
these procedures can greatly enhance our understanding of the reasons why par- 
ticular tests predict performance in criterion situations. A related benefit pertains 
to the diagnostic use of tests, insofar as the source of the individual's strengths and 
weaknesses can be linked to particular cognitive processes. These are worthy 
goals, but their practical implementation still requires considerable research on 
remaining unsolved problems (see, e.g., Wainer, 1993a). Much research is now in 
progress on the development of items that permit identification of the cognitive 
processes employed by individual respondents in solving particular items (Will- 
son, 1994). Analysis of the type of errors made by individuals provides promising 
leads for this purpose (Kulikowich & Alexander, 1994). 


Individual Tests 


n Part Two, we were concerned with the major principles of psychological 

testing. We are now ready to apply these principles to the evaluation of spe- 

cific tests. We now know what questions to ask about each test and where 

to look for the answers. The test manuals and the Mental Measurements Year- 

books are among the principal sources that may be consulted to obtain informa- 
tion regarding any of the tests cited.! 

The purpose of the remaining parts of the book is twofold. One objective is to 
afford an opportunity to observe the application of testing principles to a wide va- 
riety of tests. Another is to acquaint the reader with a few outstanding tests in 
each of the major areas. No attempt will be made to provide a comprehensive sur- 
vey of available tests within any area. Such a survey would be outside the scope of 
this book. Moreover, it would probably be outdated before publication, because of 
the rapidity with which new or revised tests appear. For these reasons, the discus- 
sion will concentrate on a few representative tests in each category, chosen either 
because of their widespread use or because they illustrate important developments 
in testing procedure. We consider ability testing in Part Three, personality testing 
in Part Four, and the applications of testing to a variety of contexts in Part Five. 
Unless otherwise indicated, all data about tests discussed in the text are taken 
from the test manual or technical supplements supplied by the test publishers. 
Readers who want to review a particular test themselves may consult a suggested 
outline for test evaluation in the Study Guide for this text (Urbina, 1997). The 
Testing Standards (AERA, APA, NCME, 1985) provide more detailed guidelines 
for this purpose. 


1The ten-volume Test Critiques series (Keyser & Sweetland, 1984-1994) provides another useful source of infor- 
mation and reviews on hundreds of tests. 
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Traditionally called "intelligence tests," the types of tests to be discussed in this 
and the next chapter are the direct descendants of the original Binet scales. Such 
tests are designed for use in a wide variety of situations and are validated against 
relatively broad criteria (see L. R. Aiken, 1996). They characteristically provide a 
single summary score, such as the traditional IQ, as an index of the examinee's 
general level of performance. Usually, they also yield scores on subtests or groups 
of subtests assessing more narrowly defined aptitudes. Because so many intelli- 
gence tests are validated against measures of academic achievement, they are 
often designated tests of scholastic aptitude or academic intelligence. Intelligence 
tests are frequently employed as preliminary screening instruments, to be followed 
by tests of special aptitudes. This practice is especially prevalent in the testing of 
normal adolescents or adults for educational and occupational counseling, per- 
sonnel selection, and similar purposes. Another common use of general intelli- 
gence tests is to be found in clinical testing, especially in the identification and 
classification of mentally retarded persons. For clinical purposes, individually ad- 
ministered tests are generally employed. Among the most widely used individual 
(as contrasted to group) intelligence tests are the Stanford-Binet and Wechsler 
scales discussed in this chapter. Because the Stanford-Binet is the first test to be 
covered in this text, it is discussed more fully than are other tests throughout the 
book. This is done in order to illustrate at the outset the kinds of information to 
be considered in evaluating a test. It should be noted, however, that the discus- 
sions of particular tests throughout this book are not to be regarded as test re- 
views, such as are provided in sources like the Mental Measurements Yearbooks. For 
the purposes of this text, attention is usually focused on special contributions of a 
particular test or on the features that distinguish it from other tests. 


STANFORD-BINET INTELLIGENCE SCALE 


Evolution of the Scales. The original Binet-Simon scales, published in 
France in 1905, 1908, and 1911, have already been described briefly in chapter 2. 
It will be recalled that among the several translations and adaptations of the early 
Binet tests that appeared in the United States, the most viable was the Stanford- 
Binet The first Stanford revision of the Binet-Simon scales, prepared by Terman 
and his associates at Stanford University, was published in 1916 (Terman, 1916). 
This revision introduced so many changes and additions as to represent virtually a 
new test. Over one third of the items were new, and a number of old items were 
revised, reallocated to different age levels, or discarded. The entire scale was 
restandardized on an American sample of approximately 1,000 children and 400 
adults, Detailed instructions for administering and scoring were provided, and the 


*See Flanagan, Genshaft, and Harrison’s (1997) edited volume on Contemporary Intellectual Assessment for an 
excellent overview of many of the topics discussed in Part Three of this text. 

3Detailed accounts of the Binet-Simon scales and of the development, use, and clinical interpretation of the 
Stanford-Binet can be found in Sattler (1982, 1988). 
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IQ was employed for the first time in any psychological test. The second Stanford 
revision, appearing in 1937, consisted of two equivalent forms, L and M (Terman 
& Merrill, 1937). In this revision, the scale was once again greatly expanded and 
completely restandardized on a new sample of the U.S. population. Despite serious 
efforts to obtain a cross section of the population, however, the sample of 3,184 
examinees was somewhat higher than the U.S. population in socioeconomic level, 
contained an excess of urban cases, and included only native-born Whites. 

A third revision, published in 1960, provided a single form (L-M) incorporating 
the best items from the two 1937 forms (Terman & Merrill, 1960). In preparing 
the 1960 Stanford-Binet, the authors were faced with a common dilemma of psy- 
chological testing. On the one hand, frequent revisions are desirable in order to 
profit from technical advances in test construction and from prior experience in 
the use of the test, as well as to keep test content up to date. The last-named con- 
sideration is especially important for information items and for pictorial material, 
which may be affected by changing fashions in dress, household appliances, cars, 
and other common articles. The use of obsolete test content may seriously under- 
mine rapport and may alter the difficulty level of items. On the other hand, revi- 
sion may render much of the accumulated data inapplicable to the new form. Tests 
that have been widely used for many years have acquired a rich body of interpre- 
tive material, which should be carefully weighed against the need for revision. For 
these reasons, the authors of the Stanford-Binet chose to condense the two earlier 
forms into one, thereby steering a course between the twin hazards of obsolescence 
and discontinuity. The loss of a parallel form was not too great a price to pay for ac- 
complishing this purpose. By 1960, there was less need for an alternate form than 
there had been in 1937, when no other well-constructed individual intelligence 
scale was available. The 1960 Stanford-Binet did not involve a restandardization of 
the normative scale. New samples were utilized only to identify changes in item 
difficulty over the intervening period. Consequently, mental ages and IQs on the 
1960 Form L-M were still expresseed in terms of the 1937 normative sample. 

The next stage was the 1972 restandardization of Form L-M (Terman & Mer- 
rill, 1973, Pt. 4). This time the test content remained virtually unchanged, but 
the norms were derived from a new sample of approximately 2,100 cases tested 
during the 1971-1972 academic year. In comparison with the 1937 norms, the 
1972 norms were based on a more representative sample, as well as being updated 
and hence reflecting any effects of intervening cultural changes on test perfor- 
mance. It is interesting to note that the later norms showed some improvement in 
test performance at all ages. The improvement was substantial at the preschool 
ages, averaging about 10 IQ points The test authors attributed this improvement 
to the impact of the mass media on young children and the increasing literacy 
and educational level of parents, among other cultural changes. There was a 
smaller but clearly discernible improvement at ages 15 and over, which, as the au- 
thors suggested may have been associated with the larger proportion of students 
who continued their education through high school in the 1970s than in the 
1930s. On the basis of both cross-sectional and longitudinal comparisons, R. L. 
Thorndike (1977) explored these normative changes further and proposed still 
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other contributing factors, including the introduction of special television pro- 
grams designed to stimulate intellectual development in pre-school-age children. 

Rising test norms from the 1930s or 1940s to the 1970s have also been found in 
other tests of general intellectual level (Flynn, 1984, 1987). From the standpoint 
of the test user, an important implication of such findings is that individuals or 
groups examined with the earlier and later forms will appear to decline in ability 
because their performance is evaluated against a higher standard in the later form. 
The examiner should be aware of this possible artifact in interpreting scores. 


The Fourth Edition Stanford-Binet (SB-IV): General Description. 
The current edition of this well-established scale represents its most extensive 
revision (Delaney & Hopkins, 1987; Thorndike, Hagen, & Sattler, 1986a, 
1986b). While retaining the chief advantages of the earlier editions as an 
individually administered clinical instrument, this revision reflects intervening 
developments in both theoretical conceptualizations of intellectual functions and 
methodology of test construction. Continuity with the earlier editions was 
maintained in part by retaining many of the item types from the earlier forms. 
Even more important is the retention of the adaptive testing procedure, whereby 
each individual takes only those items whose difficulty is appropriate for her or his 
demonstrated performance level. 

At the same time, content coverage has been extensively broadened beyond 
the predominantly verbal focus of the earlier forms, to give more representative 
coverage of quantitative, spatial, and short-term memory tasks. Moreover, each 
item type is employed over as wide an age range as possible, thereby enabling 
more nearly comparable assessment across age levels. The current instrument is 
designed for use from the age of two years to the adult level. 


Administration and Scoring. -The standard materials required to administer 
the Stanford-Binet are shown in Figure 8-1. These include essentially four 
booklets of printed cards for flip-over presentation of test items; test objects such 
as blocks, form board, multicolored and variously shaped beads, and a large 
picture of a unisex and multi-ethnic doll; a record booklet for the examiner to 
record and score responses; and a guide for administering and scoring the scale. 
In common with most individual intelligence tests, the Stanford-Binet re- 
quires a highly trained examiner. Special training and experience are needed for 
administration, scoring, and interpretation of results. Considerable familiarity 
and practice with the scale are demanded for a smooth performance. Hesitation 
and fumbling may be ruinous to rapport, especially with a young test taker. Minor, 
inadvertent alterations in wording may alter the difficulty of items. A further 
complication arises from the fact that items must be scored as they are adminis- 
tered, since the subsequent course of the examination depends on the individual's 
performance on previously administered items. i "9b 
Over the decades, clinicians have come to regard the Stanford-Binet and similar 
individual scales not only as standardized tests but also as clinical interviews. The 
very features that make these scales difficult to administer also create opportunities 
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Figure 8-1 Test Materials Used in Administering the Stanford-Binet Intelligence 
Scale: Fourth Edition. 


(Copyright O 1986 by the Riverside Publishing Company. Reproduced by permission of the publisher.) 


for interaction between examiner and examinee and provide other sources of clues 
for the experienced clinician. The Stanford-Binet and other tests described in this 
chapter make it possible to observe the respondent's work methods, problem-solv- 
ing approaches, and other qualitative aspects of performance. The examiner may 
also have an opportunity to judge certain emotional and motivational characteris- 
tics, such as ability to concentrate, activity level, self-confidence, and persistence. 
Any qualitative observations made in the course of administering individual scales 
should, of course, be clearly recognized as such and ought not to be interpreted in 
the same way as objective test scores. The value of such qualitative observations de- 
pends largely on the skill, experience, and psychological sophistication of the exam- 
iner, as well as on her or his awareness of the pitfalls and limitations inherent in this 
type of observation. 

Unlike the age grouping of earlier editions, the SB-IV items of each type are 
now placed in separate tests in increasing order of difficulty. There are 15 tests, 
chosen to represent four major cognitive areas: verbal reasoning, abstract/visual 
reasoning, quantitative reasoning, and short-term memory (see Fig. 8-2). Al- 
though grouped under these four headings for scoring purposes, the 15 tests are 
administered in a mixed sequence, in order to keep up the interest and attention 
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of the test taker. The difficulty range of 6 of the tests spans the entire age range of 
the scale. Because of the nature of the tasks, the remaining 9 tests either begin at 
a higher level or end at a lower level, as can be seen in Figure 8-2. 
Administration of the Fourth Edition Stanford-Binet involves a two-stage 
process. In the first stage, the examiner gives the Vocabulary Test, which serves as 
a routing test to select the entry level for all remaining tests. Where to begin on 
the Vocabulary Test depends solely on the individual's chronological age. For all 
other tests, the entry level is found from a chart that combines both Vocabulary 
score and chronological age. In the second stage of testing, the examiner must 


Age 


Verbal Reasoning 
Vocabulary ..... 
Comprehension . 
Absurdities .......+ 5 
Verbal Relations. en 


Quantitative Reasoning 
Quantitative .... 
Number Series . . 
Equation Building . . . 


Abstract/Visual Reasoning 

Pattern Analysis . - 
Copying ..... 
Matrices........ 
Paper Folding and Cutting. . . 


Short-Term Memory 
Bead Memory. en 
Memory for Sentences... .. . 
Memory for Digits....-..--- 
Memory for Objects .......- 


Figure 8-2 Age Span of the Fifteen Tests of the Stanford-Binet: Fourth Edition. 
Note about gray areas: For the nine tests designed for restricted age ranges, some members 
of the standardization sample falling outside these ages took the test because of unusually 
high or low performance on the routing test. Their scores were used in estimating the 
performance of the entire age sample for normative tables, but these estimates are 
included with a special caution about their use. For details, see Guide (Thorndike et al., 
1986a, p. 7) and Technical Manual (Thorndike et al., 1986b, p. 30). 

(Adapted from The Stanford-Binet Intelligence Scale: Fourth Edition, Guide for administering and scoring, p. 
Copyright € 1986 by the Riverside Publishing Company. Reproduced by permission of the publisher.) 
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establish a basal level and a ceiling level for each test, in terms of the individual's ac- 
tual performance. The basal level is reached when four items on two consecutive 
levels are passed. If this does not occur at the entry level, testing continues down- 
ward until a basal level is reached. The ceiling level is reached when three out of 
four or all four items on two consecutive levels are failed. This is where testing 
with that particular test is discontinued for the individual. 

As each item is administered, the examiner records the score on the record 
booklet. The raw score on each test is found by entering the item number of the 
highest item administered and subtracting from it the total number of attempted 
items that were failed. Eleven tests also have sample items, which are never in- 
cluded in the score but serve only for test familiarization. For most tests, cach 
item has only one correct answer, available to the examiner in the item books and 
on the record booklet. All items are either passed or failed, in accordance with 
specified standards. Five tests call for free responses and hence require the use of 
expanded scoring n provided I the Me M ambiguous responses 
require further query by the examiner, covered in the scoring guide. 

Although the entire scale contains 15 tests, no one individual takes all these 


tests, because some are suitable only within limited age ranges. In general, the 
complete battery includes from 8 to 13 tests, depending on the test taker's age and 
her or his performance on the routing test. Administration time for the full bat- 
ro range from 30 to 90 minutes, but less experienced examiners 
ike longer. The scale can usually be administered in one session, 
possibly with a few minutes’ intermission between tests. For some purposes, cer- 
tain abbreviated batteries are suggested in the Guide; these require less testing 
time but focus on those tests best suited for the particular testing purpose. They 
include a 6-test general-purpose abbreviated battery and a 4-test quick screening 
battery; both contain at least one test in each of the four cognitive areas. Also 
are three batteries for assessing students for gifted programs, appropriate 

for each of three age levels; and three batteries for students having learning pr jb- 
lems in school, also corresponding to three age levels. Standard procedures for 
levels and for administering and scoring are followed in all these abbrevi- 


d batteries. The Examiner's (Delafiey & Hopkins, 1987) clarifies 
»cedural issues the administration and scoring of the SB-IV 
with different of examinees. - 


Standardization and Norms. The standardization sample consisted of 
slightly over 5,000 cases between the ages of 2 and 23 years, tested in 47 states 
(including Alaska and Hawaii) and the District of Columbia. The sample was 
stratified to match closely the proportions in the 1980 U.S. Census in geographic 
region, community size, ethnic group, and sex. In addition, socioeconomic status 
was checked in terms of parental occupational and educational levels. The 
results revealed some overrepresentation at the upper and underrepresentation 
at the lower levels. These discrepancies were adjusted through differential 


‘+The five tests in question are Vocabulary, Comprehension, Absurdities, Copying, and Verbal Relations. 
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weighting of frequencies. in. computing score values in the normative tables. 
Thus, each individual from a higher socioeconomic background counted as a 
fraction of a case, whereas an individual from a lower socioeconomic background 
counted as more than one case; ba leg 

The normative tables are used to convert raw scores on each of the 15 tests to 
Standard Age Scores (SAS)? These are normalized standard scores with a mean 
of 50 and an SD of 8 within each age group. Normative tables are provided at 
4-month intervals between the ages of 2 and 5 years, at 6-month intervals be- 
tween 6 and 10 years, and at l-year intervals between 11 and 17; there is a single 
normative table for the age level 18 to 23. The record booklet provides a chart for 
plotting a profile of the test taker's SAS performance op each test administered, 

Standard Age Scores are also available for each of the four cognitive areas an 
for composite performance on the entire scale. The area and composite SASs are 
found from the SASs of the tests administered to each individual, by referring to 
appropriate normative. tables. These SASs are also normalized standard scores, 
but with a mean of 100 and an SD of 16. They are thus expressed in the same 
units as the deviation IQs of the earlier editions of the Stanford-Binet. However, 
the use of the term “IQ” has now been completely abandoned. For special 
purposes, there are provisions for the examiner to find SASs for any desired com- 
bination of two or more area scores ("partial composites"), For example, the com- 
bination of verbal and quantitative reasoning corresponds closely to “scholastic 
aptitude” and may be of particular interest in connection with the assessment of 
academic achievement or readiness. ` 


levels. 

Supplementary data on test-retest reliability were obtained on 57 5-year-old 
preschoolers em 55 8-year-old schoolchildren retested after intervals of 2 to 8 
months. In general, stability was good for the composite score, with reliability co- 
efficients of 91 and .90 for the two groups. While the Verbal Reasoning area score 
yielded reliabilities in the high .80s, other area scores and separate tests Se 
considerable fluctuation. It is difficult to interpret such findings, because of 


Tables ae in Thomdike et all 19860, pp. 183-188. A few SAS values, based on fewer than 100 caet weee DL 


mated statistically for the tire age cohort and are shaded in the tables Such scores occurred when 
performed unusually high or low on the routing test relative co their age (Thorndike et al 1986b, pp. 29-30). 
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possible influence of restricted ranges on some tests and an appreciable practice 
effect, which could vary substantially across individuals. 

In addition to the reliability coefficients, both the Guide and the Technical 
Manual give the standard error of measurement (SEM) within each age level for 
each test, area score, and composite score. Such SEMs are needed to evaluate 
individual scores and to interpret interscore differences for profile analysis. The 
overall composite SAS (M = 100, SD 7:16) has SEMs of about 2 to 3 points. For 
example, if we take 2.5 as an approximate intermediate value, this means that the 
chances are 2 to 1 that an individual's *true" composite score differs by 2.5 points 
or less from her or his obtained score; and the chances are 95 out of 100 that it 
varies by no more than 5 points (2.5 x 1:96 = 4.90). 

The Stanford-Binet's Examiners Handbook (Delaney & Hopkins, 1987) pre- 
sents an interpretive framework that seeks to generate and cross-validate hy- 
potheses derived from the quantitative and qualitative data gathered from the 
battery. Quantitative analysis follows a model first suggested by E B. Davis (1959) 
and applied by Kaufman (1979, 1994) and others to the Wechsler scales. It con- 
sists essentially of comparisons between and among the composite score and the 
four area scores (see Fig. 8-2), evaluated for statistical significance of differences 
in terms of the SEM. The frequency of the obtained differences is also compared 
with base rate data from the standardization sample. In addition, intraindividual 
strengths and weaknesses on the abilities tapped by each test can be assessed sys- 
tematically by comparing a person's average performance on the composite and 
area scores and the scores on the individual tests. The handbook provides the in- 
formation necessary to carry out these profile analyses, as well as four complete 
examples of their application; it should prove to be of great value to both begin- 
ning and experienced examiners. 


Validity. In keeping with current conceptions of test validation, the develop- 
mient of the Fourth Edition Stanford-Binet followed multiple approaches in iden- 
tifying and defining appropriate constructs to be included. The initial choice of 
constructs was guided by a consideration of the available research literature on 
the nature and measurement of intelligence (R. L. Thorndike et al., 1986b, chap. 
1). The experience with the earlier editions of the Stanford-Binet itself, with the 
strengths and weaknesses it revealed, further guided the test construction plans 
and decisions. For example, the separation of item types into reliable subtests was 
a needed substitute for the traditional clinical practice of informal pattern 
analysis through subjective item groupings. 

Following initial choice and preliminary definitions of constructs to be assessed 
in the SB-IV, items were identified and new items developed to fit the construct 
definitions. The item pool was subjected to extensive and statistically sophisti- 
cated procedures of item analysis, including both judgmental and statistical assess- 
ment of item bias (R. L. Thorndike et al., 1986b, chap. 2). The final version of the 
scale resulting from several preliminary tryouts and field trials was administered 
to the standardization sample and was further investigated for three principal types 
of validation data: (1) intercorrelation and factor analysis of scores, (2) correla- 
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tion with other intelligence tests, and (3) comparison of performance of previ- 
ously identified exceptional groups (Thorndike et al., 1986b, chap. 6). 

First, for the entire standardization’ sample, intercorrelations were found 
among all test, area, and composite scores, within each age level. The median 
correlations across all ages were used in a confirmatory factor analysis. The princi- 
pal object of this analysis was to test the hypothesis of a general factor to account 
for the correlations among tests from different areas, and group factors to account 
for residual correlations within each area. Similar factor analyses were conducted 
with median correlations within each of three age groups (2 to 6, 7 to 11, and 12 
to 18-23). 

The results of all the factor analyses revealed substantial loadings of a general 
factor in all tests, thereby justifying, the use. of an overall, composite score. For 
three of the four areas, group factors accounted for a considerable proportion of 
the residual common. variance within the area. The exception was the Ab- 
stract/Visual Reasoning Area, where a high degree of specificity was found among 
the four tests. We might speculate that the failure to find clear-cut evidence of a 
group factor in this area could be associated with the cumulative effects of a 
school curriculum that is not so highly organized with regard to spatial-perceptual 
content as it is with regard to verbal and numerical contents. The everyday per- 
sonal experiences that contribute to the development of spatial-perceptual abili- 
ties. are not systematically organized into “courses” or content areas as are 
academic experiences. Hence they are less likely to foster common patterns of re- 
lationships across individuals (Anastasi, 1970, 1986b). 

A review of the results of factor analyses reported in the test manual, as well as 
those conducted independently by other investigators with the SB-IV standard- 
ization data, supported use of a composite score as a measure of general intellec- 
tual ability (R..M. Thorndike, 1990). However, investigators disagree regarding 
the number and. nature of narrower factors (see also McCallum, 1990). What 
complicates this situation is the fact that since the SB-IV consists of different sets 
of tests at different ages, the raw data for factor analyses (i.e., the correlations be- 
tween test scores) differ correspondingly. This results in different types and num- 
bers of factors—anywhere between two and. four—emerging at different age 
levels. The disarray is compounded by the diversity of ways of applying factor 
analysis in different investigations. In general, however, the four-factor model 
postulated in the development of the SB-IV is more closely approximated with 
increasing age. This is especially, true when confirmatory, -as opposed. to. ex- 
ploratory, factor analyses are used. aly 

A second source of validation data was based on a set of studies with groups 
that had taken the Fourth Edition Stanford-Binet and one other intelligence test, 
including Form L-M of the Stanford-Binet itself. The groups included school- 


children enrolled in regular classes and designated as “non-exceptional” by their 
|” groups in special programs for gifted, learning 


schools, and three “exceptional 2 
disabled, and mentally retarded students. In the non-exceptional sample, the 


MERERI 
ÜThe other tests were the WISC-R, WAIS-R, WPPSI, and K-ABC, all discussed later in this chapter. 
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correlation of the deviation IQ on the earlier Stanford-Binet (Form L-M) with 
the composite score on the Fourth Edition was .81; the next highest correlation 
was with the Verbal Reasoning Area (.76), and the lowest was with the Abstract/ 
Visual Reasoning Area (.56), as would be expectéd from the similarities and dif- 
ferences in the content of the two editions. For all groups the correlations be- 
tween SB-IV composite and area scores with total or partial scores on the other 
intelligence tests were generally consistent with the hypotheses regarding the 
constructs tested. At the same time, a detailed examination of all the correlations 
found between particular scores on the Fourth Edition and particular scores on 
other intelligence tests contributes to a firmer understanding of the constructs 
now measured by the Stanford-Binet. 

A third set of special studies, using exceptional samples; demonstrated that the 
SB-IV can properly identify the performance level of gifted, learning disabled, 
and mentally retarded test takers. The gifted sample obtained significantly higher 
means than the standardization sample in the composite score and in each of the 
four area scores. The means of both learning disabled and mentally retarded stu- 
dents’ composite and area scores were significantly lower than those of the stan- 
dardization sample, and the means of the mentally retarded examinees were 
significantly lower than those of the learning disabled examinées. It should be 
noted that in all studies on exceptional groups, the participants had been previ- 
ously identified on the basis of either tests or other performance indices, and in no 
case had the SB-IV itself been included in such identification. 

A subsequent review of validity research on the SB-IV (Laurent, Swerdlik, & 
pee 1992) concludes that it seems to be at least as good a measure of general 
intellectual ability as others currently available, that it correlates well with mea- 
sures of achievement, and that it can distinguish between mentally retarded, 
gifted, and neurologically impaired persons. The reviewers suggest that the SB-IV 
may be the instrument of choice for evaluating gifted children because of the high 
ceiling provided by the age range of the test; on the other hand, they comment 
that the test lacks floor items that are low enough to diagnose mental retardation 
among the youngest test takers. 

The research needed to strengthen the interpretive meaning of various test 
scores and combinations of scores on the SB-IV has been accumulating rapidly. In 
addition, comprehensive guidance for its use is available from Sattler (1988), 
Glutting and Kaplan (1990), and Kamphaus (1993). The present revision of the 
Stanford-Binet constitutes a genuine advance in the construction of the scale. 
The Fourth Edition affords examiners the flexibility to assess separate abilities ap- 
propriate for specific testing purposes. It also agrees more closely with current the- 
oretical and research findings on the nature of intelligence (see chap. 11). 


THE WECHSLER SCALES 


The intelligence scales developed by David Wechsler include several successive 
editions of three scales, one designed for adults, one for school-age children, and 
one for preschool children. Besides their use as measures of general intelligence, 
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the Wechsler scales have been investigated as a possible aid in psychiatric diagno- 
sis. Beginning with the observation that brain damage, psychotic deterioration, 
and emotional difficulties may affect some intellectual functions more than oth- 
ers, Wechsler and other clinical psychologists argued that an analysis of the indi- 
vidual’s relative performance on different subtests should reveal specific 
psychiatric disorders. The problems and results pertaining to such a profile analy- 
sis of the Wechsler scales are analyzed in chapter 17, as an example of the clinical 
use of tests. 

The interest aroused by the Wechsler scales and the extent of their use are at- 
tested by several thousand publications appearing to date about these scales. In 
addition to the usual test reviews in the Mental Measurements Yearbooks, research 
pertaining to the Wechsler scales has been surveyed periodically in journals 
(Guertin, Frank, & Rabin, 1956; Guertin, Ladd, Frank, Rabin, & Hiester, 1966; 
Guertin, Ladd, Frank, Rabin, & Hiester, 1971; Guertin, Rabin, Frank, & Ladd, 
1962; T. D. Hill, Reddon, & Jackson, 1985; Littell, 1960; Rabin & Guertin, 1951; 
LL. Zimmerman & Woo-Sam, 1972) and has been summarized in several books 
(e.g., Forster & Matarazzo, 1990; Gyurke, 1991; Kamphaus, 1993; Kaufman, 
1979, 1990, 1994; Sattler, 1988, 1992). 


Antecedents and Evolution of the Wechsler Intelligence Scales. The 
first form of the Wechsler scales, known as the Wechsler-Bellevue Intelligence 
Scale, was published in 1939. One of the primary objectives in its preparation was 
to provide an intelligence test suitable for adults. In first presenting this scale, 
Wechsler (1939) pointed out that previously available intelligence tests had been 
designed primarily for schoolchildren and had been adapted for adult. use by 
adding more difficult items of the same kinds. The content of such tests was often 
of little interest to adults. Unless the test items have a certain minimum of face 
validity, rapport cannot be properly established with. adult test takers. Many 
intelligence test items, written with special reference to the daily activities of the 
schoolchild, clearly lack face validity for most adults. 

The overemphasis on speed in most tests also tends to handicap the older per- 
son. Similarly, Wechsler believed that relatively routine manipulation of words 
received undue weight in the traditional intelligence test. He likewise called at- 
tention to the inapplicability of mental age norms to adults and pointed out that 
few adults had previously been included in the standardization samples for indi- 
vidual intelligence tests. i 

It was in Kë to meet these various objections that the original Wechsler- 
Bellevue was developed. In form and — ds vn ja S deutet e all 

i i scales, each of whic s, in turn, e 
the subsequent Wechsler intelligence Re coger 
gence Scale for Children (WISC) was prepared as a downward extension of the 
Wechsler-Bellevue (Seashore, Wesman, & Doppelt, 1950). Many items were 
taken directly from the adult test, and easier items of the same type were added to 
each subtest. The Wechsler-Bellevue itself was supplanted in 1955 by the Wech- 
sler Adult Intelligence Scale (WAIS), which corrected some of the earlier scale B 
technical deficiencies with regard to size and representativeness of the normative 
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sample and reliability of the subtests. The baby of the Wechsler series, published 
in 1967, is the Wechsler Preschool and Primary Scale of Intelligence (WPPSI), 
originally designed for ages 4 to 6% years as a downward extension of the WISC, 
which was designed for ages 5 to 15. 

The development of the WISC was somewhat paradoxical, since Wechsler em- 
barked upon his original enterprise partly because of the need for an adult scale 
that would not be a mere upward extension of available children’s scales. The first 
edition of the WISC was, in fact, criticized because its content was not suffi- 
ciently child-oriented. In the revised edition (WISC-R), published in 1974 and 
designed for 6- to 16-year-olds, special efforts were made to replace or modify 
adult-oriented items so as to bring their content closer to common childhood ex- 
periences. In the Arithmetic subtest, for instance, “cigars” was changed to “candy 
bars." Other changes included the elimination of items that might be differen- 
tially familiar to particular groups of children, and the inclusion of more female 
and Black persons in the pictorial content of the subtests. Several of the subtests 
were lengthened in order to increase reliability. Improvements were also intro- 
duced in administration and scoring procedures. 


Description of the Scales. By now, each of the three Wechsler scales has 
gone through one or more revisions. The current versions, published under the 
name of David Wechsler even after his death in 1981, are the Wechsler Adult 
Intelligence Scale-Revised (WAIS-R— Wechsler, 1981), which covers the age 
span of 16 to 74 years; the Wechsler Intelligence Scale for Children- Third 
Edition (WISC-III—Wechsler, 1991), intended for children aged 6 years to 16 
years and. 11 months; and the Wechsler Preschool and Primary Scale of 
Intelligence-Revised (WPPSI-R—Wechsler, 1989), which now covers the range 
of 3 years to 7 years and 3 months. The third edition of the WAIS, currently 
under development, is anticipated for the late 1990s. 

The WAIS-R, WISC-III, and WPPSI-R share many features, including their 
basic organization into Verbal and Performance Scales, each of which consists of 
a minimum of five subtests (and a maximum of seven) and yields separate devia- 
tion IQs. The individual scores on all 10 of the regularly administered subtests 
(11 for the WAIS-R) are combined into a Full Scale IQ which has a mean of 100 
and an SD of 15, as do the Verbal and Performance IQs: Of the 17 different kinds 
of subtests used in the WAIS-R, the WISC-III, and the WPPSLR, eight (5 verbal 
and 3 performance subtests) are common to all three scales. In administering the 
scales, the verbal and performance subtests are alternated and given in a predeter- 
mined sequence that varies with each scale. 

The Information subtest is the first verbal subtest to be administered in all three 
scales and serves as a good rapport builder. Efforts have been made to avoid special- 
ized knowledge. The first items are easy enough to be passed by the vast majority of 
examinees, unless they are mentally retarded or have reality orientation problems. 
In such cases the examiner may quickly decide to discontinue the testing. The ques- 
tions in the WAIS-R and WISC-III versions of Information cover facts that most 
persons in the United States would have had a chance to learn, such as "What 
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month comes right before December?” or "Who was Mark Twain?” The WPPSI-R 
has similar questions, albeit at a lower difficulty level. It actually begins with a few 
items presented in picture form that require only pointing to the correct response. 
For example, upon presentation of a picture of several household objects, the child 
might be asked which one is used for cleaning. The Arithmetic subtest is another 
verbal measure that illustrates the wide range of difficulty across the Wechsler 
scales. The easiest WPPSI-R Arithmetic items require pointing to the one object 
pictured in an array that illustrates a quantitative concept (such as “smallest” or 
“more”). More complex items may involve counting or solving arithmetic problems 
the most difficult of which may require a good understanding of fractions. 

The performance subtests of the Wechsler scales typically require the manipu- 
lation of various objects, such as puzzles and blocks, or the visual scanning of 
printed materials, like pictures or symbols. They all place time limits on the test 
taker, who in most cases is also given bonus points for speed. In the Verbal Scale, 
by contrast, only one subtest (Arithmetic) is speeded. Picture Completion is a 
performance subtest shared by all three Wechsler scales; it requires the examinee 
to identify what important part is missing from pictures of common objects or 
scenes. The items for earlier ages rely on basic visual inspection—for example, by 
presenting the picture of an animal with a limb missing. The more difficult items 
require also some deductive reasoning, specific knowledge, or both, in order to 
identify what is missing. Figure 8-3 shows two relatively easy Picture Completion 
items similar to those used in the Wechsler scales. 


Abbreviated Scales. ` Since the publication of the original Wechsler-Bellevue, 
a large number of abbreviated scales or short forms have been proposed for the 
Wechsler scales. The goal of these shortened scales is to reduce the administration 
time substantially while obtaining an estimated Full Scale IQ that can be evaluated 
in terms of the published norms. One way in which these shorter forms are con- 
structed is simply by omitting some of the subtests and prorating scores. Abbreviated 
scales have also been formed by reducing the number of items within subtests. 

The fact that several subtest combinations correlate over .90 with Full Scale 
IQs has encouraged the development and use of abbreviated scales for rapid 
screening purposes. Extensive research has been conducted to identify the most 
effective combinations of two, three, four, and five subtests in predicting Verbal, 
Performance, and Full Scale IQs (Matarazzo; 1972; McCusker, 1994; Sattler, 
1988, 1992). Much of this research has utilized standardization data, but a few 
studies have been conducted. on special populations, such as psychiatric patients 
and mentally retarded persons. ` 

An excessive. amount of energy seems to have been expended in assembling 
and checking short forms of the Wechsler scales. However, questions have been 
raised about the quality of the actual procedures employed in deriving abbrevi- 
ated scales (Silverstein, 1990). For example, the assumption that the original 
Full Scale norms are applicable to prorated total scores on short scales may 
not always be justified. Moreover, many of the important qualitative observa- 
tions made possible by the administration of an individual scale are lost when 


Figure 8-3 Two Picture Completion Items Similar to those Used in the Wechsler 
Preschool and Primary Scale of Intelligence-Revised. 
(Courtesy of The Psychological Corporation.) 
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abbreviated scales are used. Thus, it is probably inadvisable to use such abbrevi- 
ated versions except as rough screening devices. 


Norms and Scoring. The standardization samples of the latest Wechsler 
scales were chosen with special care to ensure their representativeness. Norma- 
tive samples included approximately 2,000 cases for each scale, with equal num- 
bets.of females and males distributed over the appropriate age groupings for each. 
Participants were selected so as to match as closely as possible the proportions in 
the most recent reports available from the U.S. Census at the time of standardiza- 
tion, with respect to such variables as geographic region, race or ethnicity, occu- 
pational level, and educational attainment. In the case of children, the occupa- 
tional level of parents was used. For each subsequent revision, the variables used 
in the stratified sampling plan have changed somewhat in a direction that makes 
the samples more inclusive. For example, the WISC-III category of ethnicity 
included four separate groups (White, Black, Hispanic, and Other), whereas the 
earlier WISC-R had used only two (White or Nonwhite). Moreover, unlike 
earlier scales, the WISC-III sample explicitly included representative numbers of 
students receiving special services in school settings, such as learning disabled 
children and children in gifted programs. 

The popularity of the Wechsler scales, which are currently the most widely 
employed individual intelligence tests, has led to a number of investigations de- 
signed to extend their usefulness. For instance, as part of a series of normative studies 
on older Americans at the Mayo Clinic, a sample of 222 persons aged 56 to 97 were 
examined to provide WAIS-R normative data beyond the highest age group in the 
standardization sample (Ivnik et al., 1992). Another study collected WAIS-R norms 
based on 130 persons over the age of 15 years (Ryan, Paolo, & Brungardt, 1990). 

Raw scores on each of the Wechsler subtests are transformed into standard 
scores with a mean of 10 and an SD of 3. All the subtest scaled scores are thus ex- 
pressed in comparable units. The appropriate Verbal, Performance, and Full Scale 
subtest scores are then added and converted to standard scores, with mean of 100 
and an SD of 15, which are designated “deviation IQs.” The WISC-III also pro- 
vides four additional factor-based index scores, namely, Verbal Comprehension 
(VCI), Perceptual Organization (POI), Freedom from Distractibility (FDI), and 
Processing Speed (PSI). The composition of these index scores closely resembles 
that of the factors typically derived from the earlier WISC-R by many indepen- 
dent investigators. The new index scores are based on combinations of two or 
four subtests and have a mean of 100 and an SD of 15. Each of the three Wechsler 
scales provides the information needed to evaluate a person's performance on 
each subtest and group of subtests in terms of the appropriate age norms. 


Reliability. The Wechsler scales rovide split-half reliability coefficients for 
each age group on each subtest pines: index score, and IQ. For all the Wechsler 


GE f 
"Except those for which split-half reliability is inappropriate (i.e., Digit Symbol, Coding, Animal Pegs, Symbol 


Search, and Digit Span). 
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scales, split-half reliabilities of Full Scale IQs ranged from .90 to .98. For Verbal 
IQs, they ranged from .86 to .97, and for Performance IQs, from .85 to .94. The 
WISC-III factor-based index scores had split-half reliabilities ranging from .80 to 
.95. As would be expected, subtests had lower reliabilities. For the three 
Wechsler scales, subtest reliabilities ranged from .52 to .96, with the vast 
majority above .70. It is particularly important to consider subtest reliabilities 
when evaluating the significance of differences between subtest scores of the 
same individual, as in profile analysis (J. H. Kramer, 1990, 1993; Sattler, 1988, 
1992). The Wechsler manuals also report standard errors of measurement for all 
their scores. For Verbal IQs, such errors vary between 2.50 and 4.98 points; for 
Performance IQs, they range from 3.67 to 4.97; and for Full Scale IQs, they are 
all below 4.00. We may thus conclude, for example, that the chances are roughly 
2:1 that an individual's true Full Scale IQ falls within 4 points of her or his 
obtained Full Scale 1Q. 

Data on the retest reliability (or stability) of Wechsler scale scores have been 
gathered more thoroughly with each revision. Stability coefficients tend to be 
higher for adults than for children. The retest studies invariably show increases of 
between 2 and 13 points in the various IQs from the first to the second test, over 
intervals of 12 days to 9 weeks; Full Scale IQs typically rise between 5 and 7 
points. Such an expected practice effect, although slight, should be taken into ac- 
count when retesting individuals after a short time. 

The WPPSI-R and the WISC-III manuals are the most recent in a series that 
has consistently improved over time. Among their many noteworthy features is 
the inclusion of scorer reliability figures for subtests that require substantial judg- 
ment in scoring. The data suggest that these subtests can be scored with acceptable 
reliability by persons who have been trained and given adequate practice. Another 
innovation in the manuals of these two scales is the provision of fairly extensive 
data concerning intraindividual score differences. In addition to tables showing 
the amount of difference needed for statistical significance, the manuals present 
the frequencies of differences found within the standardization sample. This sort of 
information is of particular value in the clinical use of these tests (see chap. 17). 


Validity. Nowhere has the improvement in the successive Wechsler manuals 
been more pronounced than in the area of validity, In 1981, when the WAIS-R 
was published, the manual itself had no validity data on the instrument other 
than a couple of correlational studies between it and earlier Wechsler scales. 
This was even less validity information than could be found in the WPPSI and 
WISC-R manuals, which at least had correlational data for those scales and 
other non-Wechsler intelligence scales, such as the Stanford-Binet. The limited 
coverage of validity in the older manuals, however, was somewhat compensated 
by the extent of available independent research on the validity of all the 
Wechsler scales? In part, the lack of emphasis on validity information in the 


8See Dean, 1977, 1979, 1980; Gutkin and Reynolds, 1981; G. P. Hollenbeck and Kaufman, 1973; Karnes and 
Brown, 1980; Kaufman, 1975; Kaufman and Hollenbeck, 1974; Leckliter, Matarazzo, and Silverstein, 1986; Sil- 
verstein, 1982a, 1982b; and Waller and Waldman, 1990. 
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Wechsler manuals arose from Wechsler’s conviction that the tasks in his scales 
covered a sufficient range of specific abilities to ensure a valid assessment of 
general intelligence (Wechsler, 1958; Zachary, 1990). 

Wechsler's treatment of validity reflected essentially a content-description ori- 
entation, although it had overtones of a construct-identification approach, with 
little supporting data. For example, correlations between the Wechsler scales and 
other global measures of intelligence, such as the Stanford-Binet, clustered 
around .80. In addition, factor analytic studies, conducted by independent inves- 
tigators over the years, were also remarkably consistent in their findings. Almost 
invariably, a verbal and a perceptual-organizational or nonverbal factor were 
found in all age groups. Typically, additional memory and/or attention factors also 
emerged in the scales for older children and adults. The use of the Wechsler scales 
in occupational and educational selection decisions was justified, to some extent, 
by data that showed differences in the expected direction among various groups. 

In contrast, validity data of all types on the latest Wechsler scales are quite 
abundant. The WPPSI-R and WISC-III manuals devote 21 and 38 pages, respec- 
tively, to a discussion of validity as opposed to two pages on that topic in the 
WAIS-R manual. Construct-related validation data are provided by intercorrela- 
tions among the subtests and factorial analyses of the scores. The average inter- 
correlations between the Verbal and Performance scales, across age groups, for the 
standardization samples are .74 for the WAIS-R, .66 for the WISC-III, and .59 for 
the WPPSLR; these figures suggest the presence of a large general factor that is 
cotroborated by most factor analyses of the scales. 

Factor analytic studies across the nine age groups of the WAIS-R standardiza- 
tion sample repeatedly indicate that a three-factor model accounts best for the 
correlations among the 11 subtests. These factors, which appear to be generaliz- 
able to different types of samples, are Verbal Comprehension, Perceptual Organi- 
zation, and Memory/Freedom from Distractibility (Leckliter et al., 1986; Waller 
& Waldman, 1990). The same type of analyses conducted with the WPPSI-R 
standardization data and described in the manual and elsewhere provide a two- 
factor solution consistent with the organization of the subtests into Verbal and 
Performance scales (Blaha & Wallbrown, 1991; LoBello & Gulgoz, 1991; B. J. 
Stone, Gridley, & Gyurke, 1991). On the other hand, the WISC-III data, which 
from the outset were subjected to a number of exploratory and confirmatory 
analyses described in the manual, seem to be accommodated best by a four-factor 
model, including Verbal Comprehension, Perceptual Organization, Freedom from 
Distractibility, and Processing Speed. These four factors have been incorporated 
into the standard scoring of the WISC-III. 

The WPPSLR and the WISC-III manuals also present validity data from. nu- 
merous studies, albeit with small convenience samples, that correlate the two 
scales with other individually administered tests. In the case of the WISC-III, 
some correlations with group-administered achievement scales and sc 
grades are also cited. Moreover, the diagnostic or criterion-predictive utility of 
the WISC-III and WPPSLR has also been investigated in a series of studies with 
special groups, such as gifted, mentally retarded, learning disabled, and cther 
types of children. 
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Concluding Remarks on the Wechsler Scales 


The successive editions of the three Wechsler scales reflect an increasing level of 
sophistication and experience in test construction, corresponding to the decades 
when they were developed. In comparison with other individually administered 
tests, their principal strengths stem from the size and representativeness of the 
standardization samples, particularly for adult and preschool populations, and the 
technical qualities of their test construction procedures. The treatment of relia- 
bility and validity in the WISC-III manual is especially commendable. The popu- 
larity of the Wechsler scales assures them of a constantly expanding research base 
for the time being. Many ancillary materials, such as computer-assisted interpre- 
tive programs, manuals for training examiners (e.g., Fantuzzo, Blakey, & Gorsuch, 
1989), and interpretive guides (e.g., Kaufman, 1994; Nicholson & Alcorn, 1994; 
Whitworth & Sutton, 1993), are also available to their users. However, some crit- 
ics have noted that even the latest, most improved versions of the Wechsler scales 
may soon become obsolete in light of the current demands for links between as- 
sessment instruments and intervention strategies (Shaw, Swerdlik, & Laurent, 
1993; Sternberg, 1993). In this regard, the weakest feature of all che Wechsler 
scales has been their lack of theoretical grounding, which makes it hard to find a 
coherent basis for interpretation. Furthermore, the composition of the scales seems 
to presume that the ability domains tapped by their subtests across age levels are 
the same because of the superficial similarities among test materials and tasks. This 
presumption may not be justifiable in view of what we now know about develop- 
mental changes in the nature of intelligence across the life span (see chap. 11). 


THE KAUFMAN SCALES 


The Kaufman scales are individually administered clinical instruments designed 
for many of the uses for which such tests as the Stanford-Binet and Wechsler 
scales were developed and have been traditionally employed (Kaufman & Kauf- 
man, 1983a, 1983b, 1990, 1993). Developed in the 1980s and early 1990s, the 
Kaufman scales incorporate recent advances in test construction. The Kaufman 
Assessment Battery for Children (K-ABC— Kaufman & Kaufman, 19832, 1983b) 
and the Kaufman Adolescent and Adult Intelligence Test (KAIT—Kaufman & 
Kaufman, 1993), in particular, represent attempts on the part of their authors— 
both of whom were involved in the development of the WISC-R—to move be- 
yond the atheoretical stance of the older intelligence scales. They sought to 
create instruments that by design would be anchored in evolving theories of intel- 
ligence, would include developmentally appropriate tasks, and would provide use- 
ful information for a variety of assessment situations. 


Kaufman Assessment Battery for Children (K-ABC) 


Nature and Development. The design of the K-ABC began with the for- 
mulation of constructs to be assessed. In line with the general orientation of 
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cognitive psychology, the focus was on information processing. The particular 
approach chosen differentiates between Simultaneous Processing, represented 
by seven subtests, and Sequential Processing, represented by three subtests (J.P. 
Das, 1984; J. P. Das, Kirby, & Jarman, 1975, 1979; J. P. Das & Molloy, 1975; 
Kaufman & Kaufman, 1983b, chap. 2; Luria 1966). The. Simultaneous 
Processing subtests require the synthesis and organization of spatial and visuo- 
perceptual content that can be surveyed as a whole. The Successive Processing 
subtests require serial or temporal arrangement; they utilize verbal, numerical, 
and. visuoperceptual. content, as well as short-term. memory. Several tasks 
presented in the combined Mental Processing Scale are similar to those 
employed in neuropsychological examinations (discussed in chap. 17) and were 
chosen for this reason. 

The battery also includes an Achievement Scale, comprising six subtests. Al- 
though assessing ability in reading, arithmetic, word knowledge, and general in- 
formation, the Achievement subtests were explicitly designed not to measure 
factual knowledge taught in school. They resemble much more closely the tasks 
included in traditional aptitude or intelligence tests than those included in tradi- 
tional educational achievement tests. In the Arithmetic test, for example, the 
child looks at a series of pictures of a family visit to the zoo and responds by 
counting and carrying out simple numerical operations with the objects in each 
picture. Reading comprehension is demonstrated by performing the actions de- 
scribed in each sentence that the child reads. 

The K-ABC was standardized on a national sample of 2,000 children (aged 
2% to 124). Additional groups of Black and White children were tested to de- 
velop sociocultural norms for race and parental education that may be used as a 
supplementary interpretive aid. The design and organization of the K-ABC was 
also intended to accommodate the testing needs of special groups, such as hand- 
icapped children and children from cultural and linguistic minorities, and to aid 
in the diagnosis of learning disabilities (Kamphaus, Kaufman, & Harrison, 
1990). The battery yields four global scores: Sequential Processing, Simultane- 
ous Processing, Mental Processing Composite (combining the first two), and 
Achievement. Each of these is a standard score with a mean of 100 and an SD 
of 15. 


General Evaluation. The K-ABC has many positive features, both technical 
and practical? With it, the popular tendency to label children with a single, eval- 
uative number, such as an IQ, is certainly held in. check through the of 
multiple scores, profile analyses, and diagnostic interpretations, especially as 
ii e in gw 5 and d of the Interpretive Manual (Kaufman & Kaufman, 
1983b). Chapter 6 of this manual provides an excellent illustration of the cycle of 
hypothesis generation and hypothesis testing that is the essence of the clinical 
approach to diagnosis. In their effort to dispel certain popular misconceptions, 


°For reviews and critical discussions, see especially T. L. Miller (1984). See also Anastasi (19842, 1985c), Coff- 
man (1985), Kamphaus (1990), Kline, Snyder, and Castellanos (1996), and Page (1985). 
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moreover, the authors clearly state in the opening chapter of the Interpretive 
Manual (Kaufman & Kaufman, 1983b, pp. 20-24) that the K-ABC is not “a 
measure of innate or immutable abilities,” adding that “all cognitive tasks are 
seen as measures of what the individual has learned.” They rightly caution that, 
like any other test, the K-ABC is not “the complete test battery” but should be 
supplemented by other instruments to meet individual needs. 

Despite the authors’ precautionary statements, however, the use of the term 
“achievement tests” may have been an unfortunate choice; because of the 
prevalent misconceptions about the relation between aptitude and achievement 
tests. À test can be clearly labeled an achievement test when it is closely tied to 
specific, identifiable instructional content, to which the test takers are pre- 
sumed to have been exposed. This is not true of the tests labeled “achievement” 
in the K-ABC, where special efforts were made to dissociate the tests from spe- 
cific information acquired in the classroom. In fact, within the continuum of de- 
veloped abilities, these tests are much closer to the aptitude than to the 
achievement end, a conclusion that is corroborated by subtest intercorrelations. 
There is therefore little justification for using terminology that has acquired ex- 
cess meanings, carries unintended implications, and encourages popular miscon- 
ceptions. 

The explicit statement of a theoretical framework as a guide for task specifica- 
tions and item development in the K-ABC was a welcome innovation consistent 
with good test construction principles. Nevertheless, more than a decade after its 
release, questions remain as to whether the particular theoretical orientation cho- 
sen was the best for the purpose. Specifically, the distinction between simultane- 
ous and sequential processing has been challenged as a basis for understanding 
performance on the K-ABC, and it has been argued that the two sets of subtests 
so designated may also be characterized as verbal and nonverbal reasoning tests 
(J. P. Das, 1984; Goetz & Hall, 1984; A. R. Jensen, 1984; Keith, 1985; Keith & 
Dunbar, 1984). 

On the other hand, the research that has already accumulated on the K-ABC 
suggests that its global scores are similar to those of the WISC-R in terms of their 
predictive validity and the extent to which they measure "general intelligence" 
(Kamphaus, 1990). Because of its lesser reliance on verbal skills, the K-ABC may 
be the measure of choice for children with limited English proficiency or hearing 
impairments. A balanced presentation of the strengths and limitations of this rel- 
atively new instrument can be found in Clinical and Research. Applications of the 
K-ABC (Kamphaus & Reynolds, 1987, chap. 8). 


Kaufman Adolescent and Adult Intelligence Test (KAIT) 


Nature and Development. The KAIT (Kaufman & Kaufman, 1993) was 
designed as a measure of intelligence for ages 11 to 85 years or older. It represents 
an attempt to integrate the theory of fluid and crystallized intelligence articulated 
by Horn and Cattell (1966) with notions about adult intelligence proposed by 
other theorists (Golden, 1981; Luria, 1980; Piaget, 1972). 
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The battery is composed of a Crystallized Scale, which measures concepts ac- 
quired from schooling and acculturation, and a Fluid Scale, which taps the ability 
to solve new problems. The Core Battery consists of three subtests from each 
scale. An Expanded Battery, intended for use with individuals suspected of neuro- 
logical damage, can also be employed by adding any of four specified subtests. The 
KAIT also includes a brief Mental Status test to assess attention and orientation 
in examinees who are too cognitively impaired to take the full battery. 


General Evaluation. .. In terms of its technical qualities, the KAIT appears to 
meet psychometric standards.as well as any of the current generation of major 
intelligence scales. Its normative sampling is adequate, and the reliability and 
validity data reported in the manual appear promising. The battery is relatively 
easy to administer. Moreover, the manual includes information that is quite 
helpful in the administration and scoring of a test (e.g., what to do when someone 
responds in a language other than English). 

Nevertheless, what really distinguishes the KAIT from other adult intelligence 
scales is the care taken to develop and try out its original pool of more than 2,500 
items. The items were designed to appeal to adult test takers. They tend to require 
the kind of problem solving typical of Piaget’s formal operational thought and the 
planning evaluative functions that characterize adult thinking, according to Luria 
(1980) and Golden (1981). As a result, most of them are unusual and interesting. 
Many of the subtests are game-like even in their titles—for example, Famous 
Faces, Mystery Codes, and Double Meanings. Others feature novel tasks, such as 
Rebus Learning. In this subtest, examinees learn the words associated with a par- 
ticular rebus (drawing) and then “read” phrases or sentences made up of rebuses. 
An example is shown in Figure 8-4. The crucial test for the KAIT, as for any new 
instrument, lies in whether its appeal will be sufficient to generate the volume of 
research and applied use with which to build a rich and durable database. 


Kaufman Brief Intelligence Test (K-BIT) 


The Kaufman Brief Intelligence Test (K-BIT—Kaufman & Kaufman, 1990) was 
designed as a quick screening instrument to estimate level of intellectual func- 
tioning. Although it is individually administered, the test is simple and can be 
given by a technician. The K-BIT covers the age range of 4 to 90 years. Normed 
concurrently with the KAIT, it used about 20 percent of the KAIT standardiza- 
tion sample of 2,000. l ls J 

The Cat is not a shortened version of either the K-ABC or the KAIT. It 
consists of one verbal subtest of 45 Expressive Vocabulary items and 37 Defini- 
tions, and one nonverbal subtest of 48 Matrices. The three. scores (verbal, non- 
verbal, and composite) yielded by the K-BIT are expressed in terms of deviation 
IQ units, as are those of the other Kaufman scales. The length of the K-BIT sub- 
tests results in higher reliability coefficients than those typical of short forms of 
other scales. Nevertheless, with regard to the critical issue of correlation between 
its scores and those of full-length scales, the K-BIT does not seem to be an im- 
provement over the short forms of other intelligence tests. 


240 ABILITY TESTING 


EE 


Examiner: “Each of these drawings has a meaning (point 
to each rebus in turn). This means bus; this means 
plane; this means the; this means and." 


Stimulus: 


dA 
DP ety wih el 


Examiner: “Read these drawings.” 


Answer: The plane. The plane and bus. 


Figure 8-4 Example of the Rebus Learning Subtest of the 
Kaufman Adolescent and Adult Intelligence Scale. 


(From Kaufman & Kaufman, 1993, p. 5. Copyright © 1993 by American Guid- 
ance Service, Inc. Reproduced by permission of the publisher.) 


————————— ÁÀÁ—À 


DIFFERENTIAL ABILITY SCALES 


The Differential Ability Scales (DAS) by C. D. Elliott (1990a, 1990b) is a revi- 
sion and extension of the British Ability Scales (BAS), developed in Great 
Britain in the 1970s (Elliott, Murray, & Pearson, 1979). The DAS shares with the 
current versions of the Stanford-Binet and Wechsler scales the common goals of 
classifying persons in terms of their general level of ability and producing profiles 
of their individual intellectual strengths and weaknesses. However, both in its 
procedures and technical qualities, the DAS is unusual and embodies many ad- 
vances in psychometric theory and practice that the other scales do not share. In 
this connection, it is noteworthy that the Preface to the DAS handbook asserts 
that the terms “intelligence” and “IQ” are not part of the DAS vocabulary 
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(Elliott, 1990a, p. vi). Rather, the organization, scoring, and interpretation of the 
scale focus on the precisely defined behaviors that are actually being assessed. 
Such an explicit statement, appearing for the first time in the handbook of a gen- 
eral ability scale, should help to dispel the stereotypes and misconceptions associ- 
ated with the popular use of those terms. 


Description. The DAS was designed primarily to measure specific abilities 
with adequate reliability, so as to aid with the more complex goals of individual 
assessment, namely, differential diagnosis and treatment planning. The choice of 
tasks included in the battery was guided by a theoretical as well as an empirical 
rationale. The theoretical underpinning is eclectic and flexible. It uses a 
hierarchical approach to mental abilities that can accommodate different levels 
of generality and provide a broad base of information from which to derive 
hypotheses about individual examinees. This structure is quite compatible with 
empirical findings about the development of cognitive abilities. Unlike earlier 
scales that forced data onto a theoretical model regardless of fit, the DAS retained 
only those components for which both empirical and theoretical rationales 
converged. 

As shown in Figure 8-5, the DAS consists of 20 subtests organized into three 
major components: (1) core subtests; (2) diagnostic subtests; and (3) achieve- 
ment tests. The titles of these tests and subtests describe the tasks they comprise 
and are largely self-explanatory. Twelve core subtests and five diagnostic subtests 
constitute the cognitive battery, which is itself divided into preschool and school- 
age levels. At the preschool level, there are four core subtests for ages 2:6 to 3:5, 
and six core subtests for ages 3:6 to 5:11.19 At the school-age level (ages 6:0 to 
17:11) there are six core subtests. For each age level, the sum of the core subtests 
is used to obtain the General Conceptual Ability (GCA) score, which for the DAS 
is the overall summary score. The tests designated as the “core subtests” of the 
DAS battery are those that have high loadings on the general factor of the bat- 
tery or “g.” The diagnostic subtests, on the other hand, have low correlations 
with g and do not cluster into group factors; this means that they measure rela- 
tively independent abilities. When diagnostic tests are appropriate, from two to 
five are available depending on age level. Finally, three achievement tests are 
provided and are generally administered to examinees over the age of six years. 

As can be seen in Figure 8-5, some subtests from each of the major compo- 
nents of the DAS may also be given, and properly interpreted, outside the age 
level for which they are normally prescribed. Performance in subtests designated 
as “extended age range” and “out of level” can be compared to norms based on ap- 
propriate age samples gathered as part of the DAS standardization process. Sub- 
tests in the extended age range category may be used as additional. diagnostic 
measures when their content is relevant to the assessment purpose for a given ex- 
aminee. For example, the Block Building subtest can be given to children aged 


104 colon is frequently used to separate years and months when reporting ages; thus, "6" stands for "2 years and 
6 months." This notation is used throughout the book, whenever a series of ages is given. 
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CORE SUBTESTS 
Block Building 

Verbal Comprehension 
Picture Similarities 
Naming Vocabula 

Early Number Concepts 
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Figure 8-5 Organization of the Differential Ability Scales. 


(Adapted from Elliott, 1990b, p. 4. Copyright € 1990 by The Psychological Corporation. Reproduced by per- 
mission of the publisher.) ; 


3:6 to 4:11 to obtain more information on perceptual and fine motor skills than 
the core battery at that age allows. On the other hand, subtests normed for out-of- 
level testing (labeled with an *H" or “L” in Fig. 8-5) are appropriate only for indi- 
viduals of average-to-high (“H”) or average-to-low (“L”) levels of ability. An 
advantage of this feature is that it allows examiners to assess with an unprece- 
dented degree of accuracy the abilities of exceptional individuals who are func- 
tioning at unusually high or low levels for their age. 


Scaling and Norming. A major reason behind the conceptual and technical 
strengths of the DAS is that it benefited from the accumulated research and 
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thinking that were done in developing its ‘predecessor, the BAS. The planning, 
construction, and standardization of the BAS were carried out over a period of 
about two decades during which significant advances in psychometric theory and 
practice were occurring. Thus, the DAS is a new instrument with many up-to- 
date features, yet it reflects knowledge and experience gained from the 1960s 
through the 1980s. 

The standardization of the DAS was exemplary both in terms of the size of the 
sample and the care with which it was assembled. The sample comprised 3,475 
cases, a considerably larger number than usual for an individually administered 
test. It was designed to represent the target population of all noninstitutionalized 
English-proficient persons aged 2:6 to 17:11 living in the United States at the time 
of data collection (1987-1989). Stratification was based primarily op age; sex, 
race/ethnicity, parental education, and geographic region. The population target 
figures were based not just on one demographic variable at a time, as is typically 
the case, but on combined variables. For example, the distribution of parental ed- 
ucation for White females living in the Northeast among the DAS sample ap- 
proximated the corresponding distribution of such females in the population of 
the Northeastern United States. Target figures were calculated by using raw data 
obtained from the Bureau of the Census for the most recent period available at 
the time. Although the standardization sample was representative of the racial 
and ethnic composition of the population (categorized as Black, Hispanic, White, 
and Other), additional protocols for about 300 to 600 Blacks and Hispanics were 
collected and used for bias analysis only. Students in special education classes, 
such as those for children with mild impairments or exceptional talents, were not 
excluded from the norm group, which was meant to be inclusive of all the school- 
age population and not just a “normal” group. ' 

The DAS used a one-parameter model of item response theory (IRT)! to per- 
mit the calibration of difficulty values for each item. As a result, individuals can 
be examined with items best suited to their ability level (i.e., adaptive testing). 
The individual's score is based on both the number and the difficulty level of the 
items completed. These data were placed on a common, nonnormative scale 
that was utilized to convert the raw scores on each subtest’ to ability scores. A 
goodness-of-fit statistic, based on the correspondence between predicted and ob- 
served responses to items, was also used to identify and drop items that did not fit 
the model; this had the effect of producing more homogeneous item sets. 

In the actual administration of the DAS, the adaptive testing strategy is imple- 
mented by means of designated starting points, based on age; decision points, based 
on performance from the starting point to the decision point; and alternative stop- 
ping rules, specific to each subtest. The item sets included between these points 
were empirically identified so as to yield the most efficient balance between relia- 
bility and test length. The chief advantage of the adaptive testing strategy of the 
DAS is the flexibility it permits the examiner to select item subsets appropriate to 
each examinee: However, the fact that one can obtain ability estimates based on 


u 
See chapter 7 for explanation: 
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a common scale of item difficulty, even when different item subsets have been ad- 
ministered, has the added advantage of allowing comparisons of the subtest abil- 
ity scores of different individuals and of a single individual on different occasions. 
This characteristic makes the DAS, and other instruments constructed along the 
same lines, particularly well suited for developmental studies that use longitudinal 
methods or cross-sequential designs. 

Once the ability scores on subtests of the cognitive battery are obtained, they 
can be converted into normalized standard scores with a mean of 50'and an SD of 
10 (T scores) as well as percentile equivalents. Both types of scores are provided 
within each age group. The achievement tests use standard scores with a mean of 
100 and an SD of 15, instead of T scores, and grade-based instead of age-based 
percentiles. Age-equivalent scores can also be obtained for all DAS subtests. For 
the achievement tests, grade equivalents can be obtained as well. These equiva- 
lents indicate the age (or grade) at which the examinee's ability score corresponds 
to the median score. Since the DAS cognitive and achievement measures were 
developed and normed concurrently, the normative comparisons made possible 
by all these score transformations enable users to address a variety of questions ap- 
propriate for the precise investigation of individual problems. 

Core subtest scores on the DAS are added in order to derive the appropriate 
composite score(s) available at different age levels. All composite scores are ex- 
pressed as standard scores with a mean of 100 and an SD of 15. As Figure 8-6 indi- 
cates, for children at the youngest age level (2:6 to 3:5), the only composite available 
is the GCA score; between the ages of 3:6 and 5:11, the DAS provides two cluster 
scores (Verbal and Nonverbal Ability) in addition to the GCA. For all school-age 
examinees (aged 6:0 to 17:11), three clusters can be derived (Verbal, Nonverbal 
Reasoning, and Spatial Ability). Through extrapolations of the relationship between 
raw performance and GCA scores at different ages, it is also possible to extend the 
use of GCA norms to levels of performance well below the standard norms. This 
provision was made to facilitate the assessment of severely retarded individuals who 
are not likely to have been represented in the normative sample. 


Reliability and Validity. The reliability indices of the DAS compare favor- 
ably with those of other intelligence tests. Test-retest reliability coefficients of the 
GCA and cluster scores, with intervals of 2 to 7 weeks between the tests, range 
from .79 to .94. For the subtests, comparable retest-reliability estimates range 
from .38 to .94, with the overwhelming majority between .60 and .90. Interrater 
reliabilities for subtests with open-ended responses that require a significant 
amount of judgment for scoring!” cluster in the mid .90s. 

Internal-consistency reliability for the DAS was assessed through the applica- 
tion of item response theory (IRT). This procedure permits the computation of 
the precise reliabilities and errors of measurement corresponding to each possible 
score on a subtest. The results, which vary widely across the ability spectrum, cor- 
roborate the well-known trend for reliability coefficients to be lower for persons at 


12The subtests in question are Word Definitions, Similarities, Copying, and Recall of Designs. 
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Figure 8-6. Composite Scores of the Cognitive Battery, Differential Ability 
Scales, 

(From Elliott, 1990b, p. 21. Copyright © 1990 by The Psychological Corporation. Reproduced by permission of 
the publisher.) 


the extremes of the score range than for persons near the center. For the DAS, 
the internal consistency reliability estimates, across all appropriate age levels, 
range from .66 to .95 for subtests, from 86 to .94 for cluster scores, and from .89 to 
96 for GCA scores. The DAS handbook also presents some comparisons between 
IRT-based and traditional internal consistency coefficients. In those cases, the re- 
liability figures produced by both methods were in close agreement. 3 

The validity of the DAS has been explored ftom both the internal and exter- 
nal points of view. In terms of internal validity, exploratory and confirmatory fac- 
tor analyses were used to derive the structure of DAS scores (see Fig. 8-6). The 
results of both types of analyses were quite similar and, on the whole, provide fur- 
ther evidence for the finding that abilities become more differentiated with 
increasing age (Anastasi, 1970). A one-factor model, encompassing the four core 
subtests used to generate the GCA score at ages 2:6 to 3:5, fits the data best 
for children in that age group. For children between the ages of 3:6 and 5:11, a 
two-factor solution emerged. In this age range; the nonverbal ability factor was 
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identified mainly by high loadings on subtests like Pattern Construction and Copy- 
ing, whereas the verbal ability factor was defined principally by the Verbal Com- 
prehension and Naming Vocabulary subtests. At ages 6:0 to 17:11, a three factor 
model, comprising verbal, nonverbal reasoning, and spatial abilities, accounted 
best for the intercorrelations among the core subtests. The five diagnostic subtests 
of the DAS were not included in the GCA or cluster scores. These subtests gener- 
ally consist of tasks requiring memory and speed in processing information. The 
fact that the diagnostic subtests have small loadings on the common factor (or 
"eil and a considerable amount of specific variance makes them ideally suited for 
the identification of an individual's strengths and weaknesses. 

Extensive data on the external validity of the DAS are described in the hand- 
book. The major sources of such data are: (1) correlations between DAS com- 
posite and subtest scores and scores on other multiple-ability batteries like the 
Stanford-Binet and Wechsler scales: (2) correlations between DAS subtest 
scores and those of other tests of specific cognitive abilities and academic 
achievement, like the Peabody Picture Vocabulary Test-Revised (PPVT-R) and 
the Woodcock Reading Mastery Tests-Revised (WRMT-R), as well as with 
school grades; and (3) investigations of the DAS score profiles of special popula- 
tions, such as gifted, learning disabled, or mentally retarded individuals. All of 
these sources of data generally support the hierarchical structure of the DAS and 
the comparability of its composite, subtest, and achievement test scores with 
similar measures. The validity of the diagnostic subtests for identifying subgroups 
of children with learning disabilities appears promising but needs to be investi- 
gated further. 


General Evaluation. As has been observed by others (Aylward, 1992; 
Reinehr, 1992), the complexity of the administration and scoring procedures of 
the DAS may hinder its dissemination and use in applied settings. Moreover, the 
DAS is a relatively new and clinically untested instrument that will have to prove 
its value in practice. A further limitation concerns examinees in the lowest and 
highest age groups covered by the DAS (i.e., 2:6 and 17:11), for whom the floor 
and ceiling levels of item difficulty may be insufficient. Nevertheless, the DAS is 
a "state of the art" instrument of its type, as yet unsurpassed in the possibilities 
and advantages it affords to its users. Its hierarchical structure, the diversity of 
abilities it covers, and the reliability with which it assesses them. allow users 
unprecedented flexibility. Specifically, examiners may choose from a wide range 
of tasks those that are most appropriate for the purposes of the evaluation and the 
needs of the person being tested. Another outstanding characteristic of the DAS 
is the excellent quality of its Administration and Scoring Manual. and its 
Introductory and Technical Handbook (Elliott, 1990a, 1990b). The handbook, in 
particular, is quite comprehensive and conveys much information with unusual 
precision, clarity, and conciseness. It should be singularly helpful to prospective 
users, especially those who wish to become better acquainted with the theoretical 
and practical advances in the field of intelligence and its measurement, which the 
DAS so well represents and describes. 
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DAS-NAGLIERI COGNITIVE ASSESSMENT SYSTEM 


Another major new instrument for the individual assessment of cognitive func- 
tioning, published in the late 1990s, is the Das-Naglieri Cognitive Assessment 
System (CAS). This measure, which was under development for over ten years, is 
based on the PASS model of intelligence proposed by its authors (J. P. Das, 
Naglieri, & Kirby, 1994; Naglieri & Das, 1990, 1997a, 1997b). In turn, the model 
is grounded in the theory of cognition and brain organization espoused by the 
Russian neuropsychologist A. R. Luria. 

The tasks of the CAS are designed to measure the basic cognitive functions 
that are involved in learning but presumed to be independent of schooling. These 
include Planning, Attention, and Simultaneous and Successive processing. The 
system employs verbal and nonverbal tests presented through visual and auditory 
sensory channels. The Planning tests require an assessment of the strategies em- 
ployed by the examinee in performing the tasks. The CAS is appropriate for use 
with individuals between the ages of 5:0 and 17:11 and has been specifically de- 
signed to link assessment with intervention. Because of its sound theoretical and 
empirical bases and the careful, large-scale standardization it has undergone, the 
completion of the CAS has been eagerly anticipated by many test users. In fact, 
its tryout version has already undergone published reviews (Lambert, 1990; 
Telzrow, 1990). Based on the preliminary evidence that is available regarding the 
validity of the CAS, it appears that this test will become an important, as well as 
innovative, tool for the assessment of cognitive status. 


ests for Special Populations 


he tests discussed in this chapter include both individual and group scales. 

They were developed primarily for use with persons who cannot be prop- 

erly or adequately examined with traditional instruments, such as the indi- 

vidual scales described in the preceding chapter or the typical group tests to 

be considered in the next chapter. Historically, the kinds of tests surveyed in this 
chapter were designated as performance, nonlanguage, or nonverbal tests. 

Performance tests, on the whole, involve the manipulation of objects, with a 
minimal use of paper and pencil. Nonlanguage tests require no language on the 
part of either the examiner or examinee. The instructions for these tests can be 
given by demonstration, gesture, and pantomime, without the use of oral or writ- 
ten language. A prototype of nonlanguage group tests was the Army Examination 
Beta, which was developed for testing foreign-speaking and illiterate recruits dur- 
ing World War I (Yerkes, 1921). Revisions of this test were subsequently prepared 
for civilian use. For most testing purposes, it is not necessary to eliminate all lan- 
guage from the test administration, since test takers usually have some knowledge 
of a common language. Moreover, short, simple instructions can usually be trans- 
lated or given successively in two languages without appreciably altering the na- 
ture or difficulty of the test. None of these tests, however, requires the test takers 
themselves to use either written or spoken language. 

Still another related category is that of nonverbal tests, more properly desig- 
nated as nonreading tests. Most tests for primary school and preschool children 
fall into this category, as do tests for illiterates and nonreaders at any age level. 
While requiring no reading or writing, these tests make extensive use of oral in- 
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e 
structions and communication on the part of the examiner. Moreover, they fre- 
quently measure verbal comprehension—such as vocabulary recognition and the 
understanding of sentences and short paragraphs—through the use of pictorial 
items which are supplemented with and accompanied by oral instructions. Unlike 
the nonlanguage tests, they would thus be unsuited for individuals who have 
hearing impairments or who do not speak English. ) 

Although the traditional distinctions between performance, nonlanguage, and 
nonverbal tests contribute to an understanding of the purposes that different tests 
may serve, these distinctions have become somewhat blurred as more and more test 
batteries have been developed that cut across these three categories. The combina- 
tion of verbal and performance tests in the Wechsler scales is a classic example. 

In the present chapter, tests have been classified, not in terms of their content 
or administrative procedures, but with reference to their principal uses. Four 
major categories can be recognized from this viewpoint: tests for the infant and 
preschool level; tests employed for a comprehensive assessment of mentally re- 
tarded persons; tests for persons with diverse sensory and motor handicaps; and 
tests designed for use across cultures or subcultures. Such a classification schema 
must remain flexible, however, since several of the tests have proved useful in 
more than one context. This is especially true of some ofithe instruments origi- 
nally designed for cross-cultural testing, which are now more commonly used in 
clinical examinations." I $ 

Finally, although some of the tests covered in this chapter were designed as 
group tests, they are frequently administered individually. A few are widely used 
in clinical testing to supplement the usual type of intelligence tests and thus pro- 
vide a fuller picture of the individual's intellectual functioning. Several permit 
the kind of qualitative observations associated with individual testing and may re- 
quire considerable clinical sophistication to make a detailed interpretation of test 
performance possible. On the whole; they are closer to the individual tests illus- 
trated in chapter 8 than to the group tests surveyed in chapter 10. 


INFANT AND PRESCHOOL TESTING 


All tests designed for infants and preschool children require individual adminis- 
tration. Some kindergarten children can be tested in small groups with the types 
of tests constructed for the primary grades. In general, however, group tests are 
not applicable until the child has reached school age. Most tests for children 
below the age of six are either performance or oral tests. A few involve rudimen- 
tary manipulation of paper and pencil. ; 
It is Gen to subdivide the first five years of life into the infant period and 
the preschool period. The first extends from birth to the age of approximately 18 
me 


'For further information, evaluation, and references pertaining to many of the types of tests illustrated in this 
chapter, see Sattler (1988, chaps. 12, 14, and 15). 


250 ABILITY TESTING 


months; the second from 18 to 60 months. From the viewpoint of test administra- 
tion, it should be noted that the infant must be tested while lying down, sup- 
ported on a person's lap, or otherwise held, as illustrated later in this chapter. 
Speech is of little use in giving test instructions, although the child's own lan- 
guage development provides relevant data. Many of the tests deal with sensori- 
motor development, as demonstrated by the infant's ability to lift her or his head, 
turn over, reach for and grasp objects, and follow a moving object with the eyes. 
The preschool child, on the other hand, can walk, sit at a table, use her or his 
hands in manipulating test objects, and communicate by language. At the 
preschool level, the child is also much more responsive to the examiner as a per- 
son, whereas for the infant the examiner serves primarily as a means of providing 
stimulus objects. Preschool testing is a more highly interpersonal process—a fea- 
ture that augments both the opportunities and the difficulties presented by the 
test situation. 

The proper psychological examination of young children requires coverage 
of a broad spectrum of behavior, including social and emotional traits as well as 
motor, language, and other cognitive abilities. In addition, there has been an 
increasing recognition of the need to take into account the nature of the 
child's environment in the assessment of children (Vazquez Nuttall, Romero, 
& Kalesnik, 1992). This ecological orientation is reflected in some of the in- 
struments discussed in this chapter. Typical scales, designed for infancy and 
early childhood and representing a diversity of approaches, are considered in 
this section. The Wechsler Preschool and Primary Scale of Intelligence- 
Revised also belongs in this category but was covered in chapter 8 in order to 
maintain the continuity of the Wechsler series. The Stanford-Binet, Kaufman 
Assessment Battery for Children, and Differential Ability Scales, all of which 
were discussed in chapter 8, are utilized in the assessment of preschool children 
as well, since they encompass the 2- to 6-year-old period in addition to older 
ages. 


Historical Background of Infant and Preschool Testing. One of the 
earliest systematic attempts to understand the development of normal infants 
and preschoolers was made in a series of longitudinal studies by Arnold Gesell 
and his associates at Yale (Ames, 1989). These studies, which spanned a total of 
four decades, led to the preparation of the Gesell Developmental Schedules 
which, when first published (Gesell et al., 1940), represented a pioneering 
attempt to provide a systematic, empirically based method of assessing the 
behavior development of young children. Most of the data for the schedules 
are obtained through the direct observation of the child's responses to stan- 
dard toys and other stimulus objects and are supplemented by information 
provided by the parent or principal caretaker. Over the years, the Gesell 
schedules were used widely by psychologists and pediatricians in both research 
and practice and, as revised and updated by other investigators, they are still 
used by some to supplement medical examinations, especially for the identification 
of neurological defects and organically caused behavioral abnormalities in early 


Tests for Special Populations 251 


life.2 Although the schedules have been superseded by newer, more psycho- 
metrically sophisticated instruments in most clinical settings, the items and 
procedures pioneered by Gesell and his associates have been incorporated into 
most other developmental scales designed for the infant level. 

The decades of the 1960s to the 1990s witnessed an upsurge of interest in tests 
for infants and preschool children. One early contributing factor for this in- 
creased interest was the rapid expansion of educational programs for mentally re- 
tarded children; another was the widespread development of preschool programs 
of compensatory education for culturally disadvantaged children. More recently, 
there have been a series of legislative mandates aimed at the early identification 
and remediation of all types of physical and mental disabilities in both preschool- 
ers and infants. Several of these laws (e.g:, P-L. 99-457 ) are amendments or ex- 
tensions of the Education for All Handicapped Children Act (PL. 94-142), 
which is discussed in more detail later in this chapter. At any rate, in order to 
meet these pressing practical needs, new tests and publications have appeared at a 
very fast pace, and considerable research has been conducted on innovative 
approaches to assessment. d 


Standardized Tests of Early Childhood Development 


Bayley Scales of Infant Development. An especially well-constructed test 
for the earliest age levels is the Bayley Scales of Infant Development, illustrated 
in Figure 9-1 and now available in its second edition (Bayley-II—Bayley, 1993). 
Incorporating some items from the Gesell schedules and other infant and pre- 
school tests, these scales, originally published in 1969, represent the end product 
of many years of research by Bayley and her co-workers, including the longitudi- 
nal investigations of the Berkeley Growth Study. 

The Bayley-II scales provide three complementary tools for assessing the de- 
velopmental status of children between the ages of 1 month and 3% years: the 
Mental Scale, the Motor Scale, and the Behavior Rating Scale. The Mental Scale 
samples such functions as sensory and perceptual acuities, memory, learning, 
problem solving, vocalization, the beginnings of verbal communication, and rudi- 
mentary abstract thinking. The Motor Scale provides measures of gross motor 
abilities, such as sitting, standing, walking, and stair climbing, as well as manipu- 
latory skills of hands and fingers; items that assess sensory and perceptual-motor 
integration are also included. At the infant level, locomotor and manipulatory 


For the most recent manual for the revised version of the original Gesell schedules, see Knoblo ch Stevens, and 
Malone (1980). Thete are several other tests that have used the name of Gesell in their title, but none of them 
cover the period of infancy. (See TIPIV fora list of those that are currently available and the 9th MMY for re- 
Ee of some of them), E Kelley and Surbeck (1991). Fi 
A bri T 3 à be found in M. E Kelley uri GA 
rief but informative history of preschool assessment dg i en (1994), Bracken (1991b), Culbertson and 


other relevant material on infant and preschool testing see 
Willis (1993), EE (1993), e R Reynolds and Kamphaus (19902), and Vazquez Nutall, Romero, and 


Kalesnik (1992); 
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Item 8. Lifts Head — Dorsal Suspension. 


Item 14. Adjusts Head to Ventral Suspension. 


Item 33. Pulls to Sitting Position. 


Figure 9-1. Infant Testing as Illustrated in the Manual for the Bayley Scales of 
Infant Development—Second Edition. 


(Adapted from Bayley, 1993, pp. 143, 145, and 150. Copyright © 1993 by The Psychological Corporation. 
Reproduced by permission.) 


development plays an important part in the child’s interactions with the environ- 
ment and hence in the development of her or his mental processes. The Behavior 
Rating Scale is designed to assess various aspects of personality development, 
such as emotional and social behavior, attention span and arousal, persistence, 
and goal directedness. It has a 5-point scoring system for each item and descrip- 
tors specific to the behavior being rated. The Behavior Rating Scale, which is 
completed by the examiner after the other two parts have been administered, is 
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based on information obtained from the child's caregiver as well as on the exam- 
iner's own impressions. 

In the technical quality of their test construction procedures, the Bayley scales 
are outstanding among tests for the infant level. Norms were established on 1,700 
children, including 50 females and 50 males in each of 17 age groups between the 
ages of 1 and 42 months. The standardization sample was chosen so as to be rep- 
resentative of the U.S. population in terms of race/ethnicity, major geographic re- 
gion, and parental education. Only normal children, defined as those born at 36 
to 42 weeks of gestation, who were without significant medical complications and 
without any history of treatment for mental, physical, or behavioral problems, 
were considered for the sample. The Mental and Motor scales yield separate de- 
velopmental indices, expressed as normalized standard scores with a mean of 100 
and an SD of 15. These indices are found within the child’s own age group, classi- 
fied in 1-month intervals up to 36 months and in 3-month intervals thereafter. 
The Behavior Rating Scale yields percentile rank scores that are, in turn, catego- 
rized as “Non-Optimal,” “Questionable,” or “Within Normal Limits.” In a recent 
evaluation of several scales for the ‘assessment of preschool ‘children, the 
Bayley—II was one of two tests that met the standards of technical adequacy across 
most criteria^ (D. P. Flanagan & Alfonso, 1995). 

Bayley observed that her scales, like all infant tests, should be used principally 
to assess current developmental status rather than to predict subsequent ability 
levels. Development of abilities at these early ages is susceptible to so many inter- 
vening influences as to render long-term predictions of little value, as discussed in 
chapter 12.5 

Since the publication of the original Bayley scales, the focus of infant assess- 
ment has changed from the evaluation of normal infants to the appraisal of chil- 
dren who are at risk for developmental delays or who are actually delayed already. 
Although more research is needed on the clinical utility of the Bayley-II, the 
scales as now constituted already include items selected for maximum discrimina- 
tion between normal and clinical samples and offer some data on special clinical 
populations; thus, they should prove helpful in the early detection of sensory and 
neurological defects, emotional disturbances, and environmental deficits. Fur- 
thermore, Aylward (1995) has prepared the Bayley Infant Neurodevelopmental 
Screener (BINS), a measure designed to assess infants from 3 to 24 months of age 
quickly, using combinations of between 11 and 13 items from the Bayley-II and 


other neurological tests. 


McCarthy Scales of ‘Children’s Abilities. At the preschool level, a well- 
constructed instrument is the McCarthy Scales of Children’s Abilities (MSCA— 
McCarthy, 1972), suitable for children between the ages of 2% and 8% years. It 


*The other one was the Woodcock-Johnson Psycho-Educational Battery-Revised: Tests of Cognitive Ability 
(Woodcock & Johnson, 1989, 1990). í ; 
5For a discussion of the uses and limitations of infant intelligence tests, see Goodman (1990). A series of articles 
on the psychometric properties of the original Bayley scales appears in Rovee-Collier and Lipsitt (1992). 
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consists. of. 18 tests, giving the examiner multiple opportunities to observe the 
child's approach to a variety of problems and stimuli. The tests are grouped into 
six overlapping scales: Verbal, Perceptual-Performance, Quantitative, General 
Cognitive, Memory, and Motor. The General Cognitive score, based on 15 of the 
18 tests in the battery, comes closest to the traditional global measure of intellec- 
tual development. This General Cognitive Index (GCI) is a normalized standard 
score, reported in the same units as traditional IQs (with a mean of 100 and an 
SD of 16) and found within each 3-month age group. In the development of the 
MSCA, the term IQ was deliberately avoided because of its many misleading 
connotations. The GCI is described as an index of the child’s functioning at the 
time of testing, with no implications of immutability or etiology. Scores on the 5 
additional scales are based on the same age groups and have a mean of 50 and an 
SD of 10. 

Over two decades of extensive research with the McCarthy scales have accu- 
mulated since this instrument, designed to be uniquely suited to the cognitive as- 
sessment of young children, was published. Of particular use are the many studies 
conducted with children of ethnic minorities and detailed by Valencia (1990), as 
well as the wealth of validity data summarized by him and by Bracken (1991a). 
With regard to the clinical use of the McCarthy scales, the guide prepared by 
Kaufman and Kaufman (1977) is still an indispensable aid. The prevailing view 
among the many reviewers is that despite some weaknesses, these scales are both 
effective and useful. The psychometric, properties of the scales are adequate for 
the most part, especially for the middle range of their target population. 


Piagetian Scales 


Although applicable well beyond the preschool level, the scales modeled on the 
developmental theories of Jean. Piaget have thus far found their major applica- 
tions in early childhood. All such scales are in an experimental form; few are 
commercially available. Most have been developed for use in the authors’ own re- 
search programs, although some of these scales are available to other research 
workers. The major contribution of Piagetian scales to the psychological testing 
of children consists in their ability to provide both a theoretical framework that 
focuses on developmental sequences in thinking processes and an assessment pro- 
cedure characterized by flexibility and qualitative interpretation. 

Some of the features of Piagetian scales, with special reference to normative in- 
terpretation of performance, were discussed in chapter.3. Basically, Piagetian 
scales are ordinal in the sense that they presuppose a uniform sequence of devel- 
opment through successive stages. These stages, spanning the period from infancy 
to adolescence and beyond, are designated as the sensorimotor, the preopera- 
tional, the concrete operational, and the formal operational stages. The scales 
also follow a "criterion-referenced" approach, insofar as they provide qualitative 
descriptions of what the child is actually able to do. Piagetian tasks focus on the 
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long-term development of specific concepts or cognitive schemata? rather than 
on broad traits. With regard to administration, the major object of Piagetian 
scales is to elicit the child's explanation for an observed event and the reasons 
that underlie her or his explanation. Scoring is characteristically based on the 
quality of the responses to a relatively small number of problem situations pre- 
sented to the child rather than on the number or difficulty of successfully com- 
pleted items. For this purpose, the child’s misconceptions revealed by the 
incorrect responses are of primary interest. The examiner concentrates more on 
the process of problem solving than on the product. 

Because of its highly individualized procedures, Piagetian testing is well suited 
for clinical work. It has also attracted the attention of educators because it per- 
mits the integration of testing and teaching. Its most frequent use, however, is still 
in research on developmental psychology. The tests themselves can be classified 
into two categories: (1) ordinal scales for infancy; and (2) tasks designed to assess 
the attainment of the preoperational, concrete operational, and formal opera- 
tional stages. Several examples of each exist and their use in a variety of research 
contexts has been recently reviewed by D. Sexton, Kelley, and Surbeck (1990). 
One test of each type; selected partly because of availability, is described below. 

The Ordinal Scales of Psychological Development were prepared by Uigiris 
and Hunt (1975) and are also known as the Infant Psychological Development 
Scales. They are designed to assess the acquisition of cognitive competencies be- 
tween the ages of 2 weeks and 2 years: These ages encompass approximately what 
Piaget characterized as the sensorimotor period and within which he recognized 
six levels. In order to increase the sensitivity of their instruments, U£giris and 
Hunt classified the responses into more than six levels, the number varying from 
1to 14 in the different scales. The series includes six scales, designated as follows: 


1. Object Permanence—the child's emerging notion of independently existing ob- 
jects is indicated by the visual following of an object and by her or his searching 
for an object after it is hidden with increasing degrees of concealment. 

2. ` Development of Means for achieving desired environmental ends—the child 
uses her or his own hands and other means, such as strings, stick, support, 
etc., in reaching for objects. 

3. Imitation —including both gestural and vocal imitation. 

4. Operational Causality—the child recognizes and adapts to objective causality, as 
shown by responses ranging from visually observing her or his own hands to 
eliciting desired behavior from a human agent and activating a mechanical toy. 

5... Object Relations in Space—the child coordinates schemata of looking and lis- 
tening to localize objects in space and understands such relations as con- 
tainer, equilibrium, gravity. 


Schemata” is the plural of “schema,” a term commonly encountered in Piagetian writings and signifying essen- 
tially a framework into which the individual fits incoming sensory data. 
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6... Development of Schemata for relating to objects—the child responds to objects 
by looking, feeling, manipulating, dropping, throwing, etc., and. by socially 
instigated schemata appropriate to particular objects (e.g; "driving" a toy car, 
building with blocks, wearing beads, naming objects). 


No norms are provided, but the authors collected data on the psychometric 
properties of their scales by administering them to 84 infants who were the chil- 
dren of graduate students and of staff members at the University of Illinois. Both 
observer agreement and test-retest agreement after a 48-hour interval are re- 
ported. In general, the tests appear quite satisfactory in both respects. Indices of 
ordinality, computed for each scale from the scores of the same 84 children, are 
also reported as highly satisfactory.’ 

Although the Ordinal Scales developed by Uigiris and Hunt were meant to be 
only provisional, they have been used extensively for research purposes.? The 
scales were originally designed to measure the effects of specific environmental 
conditions on the rate and course of development in infants. Studies of infants 
reared under different conditions and of infants participating in intervention pro- 
grams have thus far indicated significant effects of these environmental variables 
on the mean age at which children attain different steps in the developmental 
scales. These and other studies describing the use of the Ordinal Scales for map- 
ping cognitive development in both normal and exceptional infants are reviewed 
in a book edited by the authors of the scales (Uxgiris & Hunt, 1987). The se- 
quence of acquisitions traced by the scales pertain primarily to the infant's inter- 
actions with inanimate objects, interactions which, in turn, are seen as precursors 
to the development of communicative behavior and other adaptive skills (Dunst 
& Gallagher, 1983; Kahn, 1987). 

The second example of a Piagetian instrument to be discussed is the Concept 
Assessment Kit—Conservation (CAK), a test distributed by a regular test pub- 
lishing company on the same basis as other psychological tests. Designed for chil- 
dren aged from 4 to 7 years, it provides a measure of one of the best-known 
Piagetian concepts. Conservation refers to the child’s realization that such prop- 
erties of objects as weight, volume, and number remain unchanged when the ob- 
jects undergo transformations in shape, position, form, or other attributes. The 
authors (Goldschmid & Bentler, 1968b) focused on conservation as an indicator 
of the child’s transition from the preoperational to the concrete operational stage 
of thinking, which Piaget places roughly at the age of 7 or 8 years. 

Throughout the test, the procedure is essentially the same. The child is shown 
two identical objects; then the examiner makes certain transformations in one of 


7Procedures for the measurement of ordinality and the application of scalogram analysis to Piagetian scales are 
controversial, a fact that should be borne in mind in interpreting any reported indices of ordinality (see 
E. H. Hooper, 1973; A. C. Rosenthal, 1985). 
9The potential value of the scales in clinical assessment is also widely recognized; one significant step toward re- 
alizing this potential was the publication of a manual and scoring forms specifically designed for the clinical and 
educational use of the scales (Dunst, 1980). 
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them and interrogates the child about their similarity or difference. After answer- 
ing, the child is asked to explain her or his answer. In each item, one point is 
scored for the correct judgment of equivalence and one point for an acceptable 
explanation. For example, the examiner begins with two standard glasses contain- 
ing equal amounts of water (continuous quantity) or grains of corn (discontinu- 
ous quantity) and pours the contents into a flat dish or into several small glasses. 
In another task, the examiner shows the child two equal balls of Playdoh and 
then flattens one into a pancake and asks whether the ball is as heavy as the pan- 
cake. 

Three forms of the test are available. Forms A and B are parallel, each provid- 
ing six conservation tasks: Two-Dimensional Space, Number, Substance, Contin- 
uous Quantity, Discontinuous Quantity, and, Weight. The correlation between 
the scores on these two forms is .95. Form C includes two different tasks—Area 
and Length—and correlates .76 and .74 with Forms A and B, respectively. 

Norms were established on a standardization sample of 560 boys and girls be- 
tween the ages of 4 and 8, obtained from schools, day-care centers, and Head 
Start centers in the area of Los Angeles, California, These norms must be re- 
garded as tentative in view of the small number of cases in each age group and be- 
cause of the limitations in representativeness of the sample. Mean scores for each 
age group show a systematic rise with age, with a sharp rise between 6 and 8 years, 
as anticipated from Piagetian theory. 

The authors carried out various statistical analyses to assess: scorer reliability; 
Kuder-Richardson, parallel-form, and retest reliability; scalability, or ordinality; 
and factorial composition (see also Goldschmid & Bentler, 1968a). Although 
based on rather small samples, the results indicate generally satisfactory reliability 
and good evidence of ordinality and of the presence of a large common factor of 
conservation throughout the tasks. 

Comparative studies in seven countries suggest that the test is applicable in 
widely diverse cultures, yielding high reliabilities and showing approximately sim- 
ilar age trends (Goldschmid et al., 1973). Differences among cultures and subcul- 
tures, however, have been found in the mean ages at which, concepts are 
acquired—i.e., the age curves may be displaced horizontally by one or two years 
(see also Figurelli & Keller, 1972; Wasik & Wasik, 1971). Training in conservation 
tasks has been found to improve scores significantly (see also Goldschmid, 1968; 
B. J. Zimmerman & Rosenthal, 1974a, 1974b). The manual cites suggestive data 
about the construct validity of the test, as does a recent study by E A. Campbell 


and Ramey (1990). 


Evaluation of the Piagetian Approach. There is still controversy, on both 
theoretical and empirical grounds, regarding the Piagetian approach to cognitive 
development (see, eg» Inhelder, de Caprona, & Cornu-Wells, 1987; Liben, 
1983; Sugarman, 1987). There are questions about the implications of training 
effects and about whether or not cross-cultural differences affect the inter- 
pretation of Piagetian developmental stages. A major obstacle encountered in 
identifying stages through ordinal scales is what Piagetian researchers call 
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“décalage,”® or inconsistencies in the anticipated sequences. There is a growing 
body of data that casts doubt on the implied continuities and regularities of 
intellectual development. Too often the stage corresponding to a given indi- 
vidual's performance varies with the task, not only when different processes are 
required for its solution, but also when the same process is applied to different 
contents (Dasen, 1977; Goodnow, 1976; Horn, 1976; J. McV. Hunt, 1976). 

It should also be noted that Piagetian scales have been found to correlate 
substantially with standardized intelligence tests (Gottfried & Brody, 1975; 
Kaufman, 1971; M. E. Sexton, 1987) and to correlate about as much with school 
achievement of first-grade children as did a group intelligence test (Kaufman & 
Kaufman, 1972). Such overlap has been strongly corroborated by independent in- 
vestigators with different instruments (Humphreys, Rich, & Davey, 1985). What 
these findings suggest is that despite pronounced differences in methodology, Pi- 
agetian scales, standardized intelligence tests, and school achievement measures 
have much in common. In addition, they each contribute unique and valuable el- 
ements to the overall assessment of children. The Piagetian scales are more diffi- 
cult to administer and much more time consuming but, especially when 
integrated with norm- and domain-referenced measures, they yield a much richer 
picture of what the child can do and how he or she does it (D. Sexton et al., 1990). 

Current research on the mental activities of young children is progressing 
apace. The empirical findings both revise and expand the early Piagetian con- 
cepts (see, eg. Butterworth, Harris, Leslie, & Wellman, 1991; Whiten, 1991). In 
fact, there now exist a number of innovative approaches, collectively labelled as 
“neo-Piagetian,” that address issues regarding cognitive development from a per- 
spective influenced to different degrees by Piaget's theory and by the point of view 
of information processing (Beilin & Pufall, 1992; Demetriou, 1988). In the realm 
of assessment, some neo-Piagetian investigators are combining various dynamic 
approaches and using mediated learning in a formalized manner to try to evaluate 
mental capacity with minimal reliance on an individual’s previous knowledge 
base (Pascual-Leone & Ijaz, 1991). These techniques, which are still quite exper- 
imental, aim to be applicable to children as young as two or three years old as well 
as to individuals from various cultural, social, and linguistic backgrounds. 


Current Trends in Infant and Early Childhood Assessment 


Historically, the validity of intelligence tests has been linked primarily to the cri- 
teria of age differentiation and correlations with academic performance. For in- 
fants, adequate progress has been gauged almost exclusively by comparing their 
results to same-age norms on the wide range of tasks included in developmental 
scales such as Bayley’s. Recent societal efforts at early identification and remedia- 
tion of deficits, however, require that tools which are designed to assess cognitive 
functioning in infants have predictive power. Thus, in spite of the difficulties 


Literally, an “unwedging” of the theoretically expected response pattern. 
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posed by trying to track the developmental changes in intellectual competence at 
various age levels, there have been renewed attempts to generate instruments and 
procedures that will prove to be of sufficient predictive value for practical use. 

One of the more intriguing of the new approaches consists of measures of infor- 
mation processing skills, such as the Fagan Test of Infant Intelligence (Fagan, 
1992; Fagan & Detterman, 1992). This approach is based on the well-established 
finding of infants’ preference for novelty, which, in turn, makes possible the study 
of their ability to abstract and retain information. The Fagan test, designed to dif- 
ferentiate normal children from those with cognitive deficits, assesses selective vi- 
sual attention to novelty in infants from 3 to 12 months old. The stimuli are 
pictures of faces, and the "score" is based on the amount of time devoted to novel, 
as opposed to familiar, pictures. Figure 9-2 shows the portable stage version of this 
instrument which has been found to predict later performance on intelligence 
tests as well as or better than standard measures of infant intelligence. Correla- 
tions between Fagan test scores and IQ at the age of 3 years range from the mid 
40s to approximately .60. The Fagan test is still evolving and has been criticized 
on a number of counts (see, e.g, Benasich & Bejar, 1992; Goodman, 1990). Un- 
doubtedly, more data need to be accumulated on its clinical utility in predicting 
cognitive deficits, including mental retardation. Nevertheless, the approach it 
represents has a solid empirical basis and is quite consistent with the findings re- 
garding the nature of infant intelligence described in chapterl2. —- 

There is also increasing recognition of the fact that, in order for interventions 
to be effective, the assessment of young children has to be comprehensive as well 
as accurate and valid. Reliance on a single global score, whether it is called an IQ 
or a developmental index, is inadequate for most practical purposes. Such scores 
may serve to classify children into categories, but they do not inform their users 
about the strengths and weaknesses of the child. Individualized assessment of in- 
fants and children, as well as of older persons, needs to include multiple methods 
and multiple sources of information across pertinent domains, such as language, 
motor, and social skills. In response to these requirements, new systems for assess- 
ment are being developed, involving procedures and professionals from various 
disciplines. 

The Infant-Toddler Developmental Assessment (IDA), which is based on the 
work done by a multidisciplinary group of early childhood specialists (Provence, 
Erikson, Vater, & Palmeri, 1995a, 1995b, 1995c), exemplifies this trend. The IDA 
is essentially a framework that guides the team process of identifying children, 
from birth to 3 years, who are at tisk of developmental delays. Materials provided 
with the IDA facilitate the gathering, recording, interpretation, and synthesis of 


data from referral through to the development of a plan. The procedures include 


parental involvement through each phase, a health review, and a developmental 
rents and. other care- 


assessment based on observation and interviews with. pe 
givers. The portion of the IDA that most resembles traditional measures of infant 
development is the Provence Birth to Three Developmental Profile. Although 


the profile uses standard administration and scoring procedures in assessing eight 


developmental domains and sequences, its scores are age-referenced in terms of 
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Figure 9-2. Portable Stage Version of the Fagan Test of Infant Intelligence. 


(From Fagan & Detterman, 1992, p. 189. Copyright O 1992 by Ablex Publishing. Reproduced by permission.) 
duum elm E eg 


well-documented developmental milestones rather than through standard scores, 
percentile tanks, or other within-group comparisons. In this regard Sally 
Provence, the pediatrician who authored the profile, followed the tradition estab- 
lished by Gesell in his developmental schedules. The effectiveness of the IDA, 
like that of any other tool, is obviously linked to the training and experience of 
the professionals who use it and needs to be evaluated further through both re- 
search and clinical use. Nevertheless, it and other systems like it have been de- 
signed to meet the criticisms levelled against the practice of excessive reliance on 
intelligence tests (see, e.g., Goodman, 1990) and, if properly implemented, could 
prove to be of great practical value. It should be noted that use of the IDA does 
not preclude the administration of traditional measures of cognitive functioning 
or any other instruments for those purposes, such as gauging a child’s relative 
standing among her or his age peers, where they might be needed. 

Besides the move toward more comprehensive and integrated assessment, two 
other trends that are having significant impact on the testing of young children, 
and in the testing of most of the other special populations discussed in this chap- 
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ter, should be mentioned. The first is the influence of the ecological perspective 
in the assessment process, a perspective that calls for taking into account the var- 
ious aspects of the child's environment.!© The second concerns an increased em- 
phasis on the link between assessment and intervention, which has long been 
recognized in clinical work, where diagnosis is inextricably tied to treatment. The 
need to produce prescriptive guidelines for teachers, based on the individual’s 
unique profile of abilities and disabilities, is now seen as essential in the context of 
early intervention and education as well (Bagnato & Neisworth, 1991; Witt, El- 
liott, Gresham, & Kramer, 1988). 


COMPREHENSIVE ASSESSMENT OF MENTALLY 
RETARDED PERSONS 


The testing of children with either mental or physical disabilities underwent a 
conspicuous spurt of growth in the United States following the passage of the Ed- 
ucation for All Handicapped Children Act of 1975 (P.L. 94142)—now amended 
to be referred to as the Individuals with Disabilities Education Act (IDEA)—and 
related statutes.!! The implementation of this legislation requires four basic pro- 
cedures: (1) all children with disabilities must be identified through preliminary 
screening instruments; (2) the children thus identified are to be evaluated by a 
team of specialists to determine each child’s educational needs; (3) the school 
must develop an individualized educational program to meet these needs; and (4) 
each child is to be reevaluated periodically in the course of the program. Tests 
suitable for use in educational programs that meet the requirements of the law are 
discussed in several portions of this book, including chapters 8 and 17, as well as 
in this and the two adjoining sections of the present chapter (also see Jacobson & 
Mulick, 1996). 

In its manual on definition and classification, the American Association on 
Mental Retardation (AAMR, 1992) states that “mental retardation refers to sub- 
stantial limitations in present functioning. It is characterized by significantly 
subaverage intellectual functioning, existing concurrently with related limitations 
in two or more of the following applicable adaptive skill areas: communication, 
self-care, home living, social skills, community use, self-direction, health and 
safety, functional academics, leisure, and work" (p. 1). The manual also stipulates 
that the condition is manifested before the age of 18. This definition does not dif- 
fer substantially from the previous one (Grossman, 1983, p. 11). However, the as- 
sociated system of classification has changed dramatically to accommodate the 


view that mental retardation is not 4 trait but a disability that stems fron 


Eege i 
10The assessment of environments is discussed in a later section. of this chapter. 
"IThe major ones of these are PL. 99-457 and PL. 101-476, passed in 1986 and 1990, respectively. For de 
sions of the implications of federal legislation and court cases on the testing and assessment of children, : 
Ayers, Day, and Rotatori (1990), DeMers, Fiorello, and Langer (1992), M. P. Kelly and Melton (1993), and E 
tler (1988, pp. 767-784). 
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the interaction between the limitations of an individual and the demands of the 
environment. Whereas the previous system specified levels of mental retardation 
from mild to profound, based on intelligence test performance, the current one 
classifies the intensities of supports needed by the individual across four dimen- 
sions (intellectual functioning and adaptive skills, psychological/emotional con- 
siderations, physical health/etiology considerations, and environmental 
considerations). The definition of the intensities of supports, along with exam- 
ples, are listed in Table 9-1.!? 

The current definition reaffirms the notion that intellectual limitation is a nec- 
essary but not a sufficient condition for mental retardation. In order for the latter 
to exist, intellectual limitation rnust have an impact on adaptive or coping skills. 
The threshold level of intelligence test performance used to demarcate mental re- 
tardation is a score of 70 to 75 on a scale with a mean of 100 and an SD of 15, or 
approximately 2 or more standard deviations below the mean; this allows for error 
of measurement and accentuates the fact that there are no sharp dividing lines 
between the “mentally retarded" and the “normal.” The elimination of the levels 


Definition and Examples of Intensities of Supports 


an needed basis.” Characterized by episodic nature, person not always needing 
x short-term supports needed during life-span transitions (e.g., job loss or an 
. acute medical crisis). Intermittent supports may be high or low intensity when provided. 


Limited ` | 
An intensity of supports characterized by consistency over time, time-limited but not of an in- 


.. termittent nature, may require jewer staff members and less cost than more intense levels of 
t (e.g, time-limited employment training or transitional supports during the school-to- 


Supports characterized by regular jnvolvement (e.g., daily) in at least some environments 
(such as work or home) and not time-limited (e.g., long-term support and long-term home 
living support). . SH 


| 
Pervasive 0 } 
Supports characterized by their d onstancy, high intensity; provided across environments; po- 
- tential life-sustaining nature.. no supports typically involve more staff members and in- 
trusiveness than do extensive ot time-limited supports. 


(From AAMR, 1992, p. 26. Copyright ( 1992 by the American Association on Mental Retardation. Repro- 
duced by permission.) 


"The American Psychiatric Association's current Diagnostic and Statistical Manual of Mental Disorders 
(DSM-IV—1994) still specifies four degrees of severity of mental retardation based on levels of IQ, namely, 
Mild, Moderate, Severe, and Profound. 
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of retardation and the increased emphasis on adaptive skills and environmental 
considerations are aimed at focusing on the unique pattern of strengths and weak- 
nesses present in an individual and on the possibility of growth. The AAMR’s re- 
visions have proved to be somewhat controversial. Some critics have charged 
that the new specifications are imprecise, cannot be assessed reliably, and will in- 
crease the proportion of the population eligible for special services in the schools 
(Gresham, MacMillan, & Siperstein, 1995; MacMillan, Gresham, & Siperstein, 
1993; Matson, 1995). Others contend that these objections lack empirical 
grounding (Reiss, 1994). In any case, both the manner in which evaluations are 
conducted as well as the results of the evaluations themselves are likely to change 
as a result of the adoption of different standards; however, it is probably too soon 
to assess the full effects of the revisions. 

Besides individual tests of intelligence, such as those described in chapter 8, as- 
sessment programs for mentally retarded persons typically include measures of 
adaptive behavior in everyday life situations.!? The prototype of scales designed to 
assess adaptive behavior is the Vineland Social Maturity Scale, developed in the 
1930s by the director of the Vineland Training School, Edgar ‘Doll (1935/1965). 
As a result of his observations of the differences among mentally retarded patients, 
Doll created a standardized record form designed to assess an individual's develop- 
mental level both in looking after her or his practical needs and in taking responsi- 
bility in daily living. Its latest revision, the Vineland Adaptive Behavior Scales 
(VABS—P. L. Harrison, 1985; Sparrow, Balla, & Cicchetti, 1984a, 1984b), is 
available in three versions, which may be used independently or in combination. 
Two of the scales are Interview Editions, whereby information is obtained through 
a semistructured interview with a parent or other caregiver. One of these versions 
is the 297-item Survey Form, which is most similar to the earlier Vineland. The 
other is the 577-item Expanded Form, which also provides a systematic basis for 
preparing individualized educational or treatment programs. Both versions are ap- 
plicable from birth through 18 years and to low-functioning adults. The third ver- 
sion is the Classroom Edition, comprising @ 244-item questionnaire to be filled out 
by a classroom teacher; it covers ages 3 through 12 years. The correlations between 
the Classroom Edition and the Survey Form scores range from 31 to .54, which in- 


‘All versions of the Vineland focus on what the individual usually and habitu- 
ally does, not on what he or she can do. The items are classified under four major 
adaptive domains; these ate shown in Table 9-2, together with their subdomains 
and brief descriptions of the behavior covered. Both Interview Editions also in- 
clude an optional set of 32 items dealing with maladaptive or undesirable behav- 
iors that may. interfere with the individual's functioning. All versions provide 
well-designed forms for reporting results to parents. 

The two Interview Editions were standardized on a representative. national 
sample of 3,000 individuals, who ranged in age from birth to 16) years 1 | months 


331t should be noted that not all of the major individual intelligence scales work equally well when applied to 


mentally retarded persons (see, €-8+ Spruill, 1991). 
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Table 9-2 


Content of the Vineland Adaptive Behavior Scales 


Domains and Subdomains Description 


What the individual understands 
What the individual says 
What the individual reads and writes 


How the individual eats, dresses, 

and practices personal hygiene 
What household tasks the individual performs 
How the individual uses time, money, 

the telephone, and job skills 


How the individual interacts with others 

How the individual plays and uses leisure time 

How the individual demonstrates responsibility 
and sensitivity to others 


How the individual uses arms and legs for 
movement and coordination ; 

How the individial uses hands and 
fingers to manipulate objects 


Composite of the four domains listed above 


Undesirable behaviors that may 
interfere with adaptive functioning 


* Included for optional use in Survey Form and Expanded Form, but not in Classroom Edition. 


(Adapted from Sparrow, Balla, & Cicchetti, 1984a, p. 3. Copyright © 1984, 1985 by American Guidance Ser- 
vice, Inc. Reproduced with permission of the publisher. All rights reserved.) 


and who were stratified according to the 1980 U.S. Census on the basis of sex, 
ethnic group, community size, region, and parents’ educational level. Supplemen- 
tary norms were also established on special groups, including residential and non- 
residential samples of mentally retarded adults and residential samples of 
emotionally disturbed, visually impaired, and hearing-impaired children. The 
Classroom Edition was standardized on nearly 3,000 children who ranged in age 
from 3 years to 12 years 11 months; who were drawn from schools in 38 states; 
and who were stratified on the same basis used for the other editions. 

Standard scores with mean of 100 and SD of 15 are provided for the four do- 
main scores and for the Adaptive Behavior Composite on all three forms. Error 
bands (based on SEM) are given for five different confidence levels (from 68% to 
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99%). Also available for the same summary scores are percentile ranks, stanines, 
age equivalents, and adaptive levels (i.e., qualitative descriptive categories). For 
subdomain scores, the results can be expressed as adaptive levels and as age equiv- 
alents; for maladaptive behavior, results are reported as qualitative (maladaptive) 
levels only. The supplementary norms can be used to find percentile ranks and 
adaptive levels within each appropriate special group. Software is available for 
converting raw scores to derived scores and for profile analysis. 

For all editions, the median internal consistency reliability coefficients for the 
domain and composite scores are mostly in the high .90s. Understandably, the re- 
liabilities run lower for the subdomains, and they vary widely with age level and 
content area. The median subdomain reliabilities, however, are mostly in the .70s 
and .80s. Data on retest and interrater reliabilities suggest good stability over 
short time intervals and satisfactory agreement between two different interview- 
ers administering the scale to the same respondents. 

Several types of data summarized in the manuals for the three forms contribute 
to construct validation. To some extent, validity was built into the instrument 
through the initial formulation of adaptive behavior constructs that guided the 
preparation and selection of items. Empirical validation data are provided in 
analyses of the standardization samples, as well as in studies by independent in- 
vestigators. They include data on developmental trends in performance domains 
and subdomains; factor analyses of domain and subdomain scores; comparisons of 
score profiles obtained from samples of mentally retarded and disabled individuals 
included in the supplementary norms; and correlations with other adaptive be- 
havior inventories and with such ability tests as the WISC-R; K-ABC, and 
Peabody Picture Vocabulary Test. 

In general, the procedures employed in developing and: evaluating the 
Vineland were of high technical quality and are fully and clearly described in the 
manuals. They reflect advances that occurred since the publication of the earlier 
edition. However, the practical effectiveness of the instrument depends on a thor- 
ough knowledge of its psychometric characteristics as described in the manual 
and in the literature that has accumulated on it: i 

There has been widespread interest in the use of assessment results as a basis 
for designing and choosing appropriate training programs for mentally retarded 
persons. This interest, in turn; has led to the development of an increasing num- 
ber of scales for measuring adaptive behavior. One example is the Adaptive 
Behavior Scales (ABS), which were developed by the AAMR and designed with 
the same general objectives as the Vineland. The AAMR Adaptive Behavior 
Scale—Residential and Community, Second Edition (ABS-RC:2—Nihira, 
Leland, & Lambert, 1993)—was standardized on over 4,000 adults with develop- 
mental disabilities living in community or residential settings; it has scores on 


and Brown (1990), Poth and Barnett (1988), Raggio and Massingale 
(1990), Schatz and Hamdan-Allen (1995), and Silverstein (1986). For further description and independent 
evaluations of the three forms of the Vineland, see L.A. Campbell (1985) and CR. Reynolds (1986). 

see Fox and Meyer (1990), Knoff (1992), and Sattler (1988, chap. 15): 


"See, for example, Middleton, Keene, 


15 
For surveys of many of these scales, 
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18 domains, of which 10 are concerned with coping skills and 8 with social 
behavior, including various types of maladaptive patterns. On the other hand, 
the AAMR Adaptive Behavior Scale—School, Second Edition (ABS-S:2— 
Lambert, Nihira, & Leland, 1993)—was normed on both mentally retarded and 
non-retarded students, ranging in age from 3 to 18 years. 

Another area that needs to be assessed in mentally retarded individuals is 
motor development (also examined in infant scales). The prototype of instru- 
ments for that purpose is the Oseretsky Tests of Motor Proficiency, originally pub- 
lished in Russia in 1923. Other applications of the Oseretsky tests are found in 
the testing of children with motor handicaps, minimal brain dysfunction, or 
learning disabilities, particularly in connection with the administration of indi- 
vidualized training programs. The current revision of the Oseretsky scales is the 
Bruininks-Oseretsky Test of Motor Proficiency (Bruininks,1978). Requiring 45 to 
60 minutes, the complete battery comprises 46 items grouped into 8 subtests. It 
yields 3 scores: a Gross Motor Composite measuring performance of the large 
muscles of shoulders, trunk, and legs; a Fine Motor Composite measuring perfor- 
mance of the small muscles of the fingers, hands, and forearm; and a Total Battery 
Composite. There is also a 14-item Short Form, requiring 15 to 20 minutes and 
providing a single index of general motor proficiency. Performance can be ex- 
pressed in terms of age-based standard scores, percentile ranks, and stanines. Age 
equivalents are also available for each subtest. The battery was standardized on a 
sample of 765 children between the ages of 4% and 14%, chosen so as to be repre- 
sentative of the U.S. population. Test-retest reliability of the composite scores, 
over intervals of 7 to 12 days, cluster in the .80s. Validity was investigated in sev- 
eral ways, including factor analysis of items, age differentiation, and comparisons 
between children with and without mental retardation and learning disabilities. 

One of the main challenges in the assessment of mental retardation lies in dis- 
tinguishing between this condition and developmental delays, especially during 
infancy and early childhood. Not only is cognitive assessment in this period less 
reliable than at other ages, as we have seen, but there is also the possibility that 
what manifests as impaired cognitive functioning may be the result of various 
other conditions (Hodapp, Burack, & Zigler, 1990). Chief among the factors 
that can affect a child’s level of intellectual performance and adaptive skills in a 
negative direction are sensory and motor disabilities and an adverse home envi- 
ronment. The remainder of this chapter includes discussions of issues related to 
both of these factors which may, of course, be present either singly or in combi- 
nation. 


TESTING PERSONS WITH PHYSICAL DISABILITIES 


Although the problems presented by the testing of disabled persons have received 
some attention over several decades, special efforts to meet these problems were 
stimulated by legislation enacted since the 1970s. The provision of suitable edu- 
cation for all physically disabled children is covered by the previously cited Indi- 
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viduals with Disabilities Education Act. At a much broader level, the general 
Civil Rights provisions mandated for other minorities were largely extended to 
cover physically disabled persons, first through Section 504 of the Rehabilitation 
Act of 1973 and, more recently, by the Americans with Disabilities Act of 1990 
(ADA—P. L. 101-336).!5 These laws prohibit discrimination in the areas of: (1) 
employment practices; (2) accessibility of physical facilities; (3) preschool, ele- 
mentary, and secondary education; (4) postsecondary education; and (5) health, 
welfare, and social services. The ADA strengthens earlier mandates and extends 
them to organizations in the private sector. 

The testing of physically disabled children at an early age is especially impor- 
tant in order to provide appropriate educational experiences from the outset. 
Such an approach helps to prevent the cumulative learning deficits that would 
magnify the effects of the disability on intellectual development.! ^ At any age, 
the testing of physically disabled persons presents special problems with reference 
to test administration and the proper interpretation of test performance. Thus far, 
the principal ways of handling such testing include: (1) modification of the test- 
ing medium, time limits, and content of existing tests; and (2) individualized, 
clinical assessment that combines test scores with other sources of data from bio- 
graphical history, interviewing, and the judgment of adequately informed every- 
day-life observers, such as teachers (AERA, APA, NCME, 1985, chap. 13; Bailey 
& Wolery, 1989; Barnett, 1983; Culbertson & Willis, 1993; Eyde, Nester, 
Heaton, & Nelson, 1994; Scarpati, 1991; Sherman & Robinson, 1982). 

Efforts to establish separate norms for persons with specific physical disabilities 
or to design tests specially for such groups are usually hampered by the small num- 
ber of available cases. This limitation applies especially to low incidence and mul- 
tiple disabilities and to the use of tests in contexts—such as graduate and 
professional school admissions—that involve specially selected samples. Never- 
theless, research has been continuing on the performance of persons with differ- 
ent physical disabilities; who have taken either standard or specially adapted 
versions of various tests: 

One of the most ambitious series of studies was carried out at the Educational 
Testing Service using standard and nonstandard versions of the College Board 
SAT and the GRE General Test with four classes of disabled applicants—namely, 
hearing impaired, visually impaired, learning disabled, and physically, impaired 
(Willingham et al, 1988). The psychometric characteristics investigated in- 
cluded reliability, differential item functioning, factor structure, and other indices 
of validity related ro performance levels and predictive power; test content, tim- 
ing, and accommodations were also studied. In general, the results show that the 


Fora revi f the implicati A chological testing, see Nester (1994). The psychometric and 
ora review of the implications of the AD. for psychologi "rain egene fré 


assessment issue: ed by thi: discussed in a statement ol 
bës Ggs EE which was published in the January 1993 issue of The 
Score, the division's newsletter- 
VFurther information about patterns of development in young di: 
in their assessment can be found in Wachs and Sheehan (1988). 
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procedural adaptations are comparable to standard testing in most respects, in- 
cluding the meaning of scores. However, prediction of academic performance 
from scores, or prior grades, is not as accurate for persons with disabilities as it is 
for others; and there are questions concerning the factor structure and item func- 
tioning of some of the test adaptations (R. E. Bennett, Rock, & Novatkoski, 
1989; Rock, Bennett, & Jirele, 1988; Willingham, 1988). In addition, the time 
limits in the nonstandard versions have been found to be relatively more lenient, 
a finding that supports the controversial practice of "flagging" scores obtained 
using these versions. Thus, the need to develop empirically-based comparable 
time limits has emerged as a critical one (see, e.g., Wainer, 1993a, pp. 9-10). 

There are still many unresolved psychometric and ethical issues concerning 
the testing of persons with disabilities. Although more research is needed, it is 
also necessary to recognize that some of the problems may be insoluble because 
they derive from the fact that every person presents a unique configuration of 
types and degrees of abilities, disabilities, and personal characteristics. Neverthe- 
less, for practical purposes, there is now a greater level of awareness of and 
sensitivity toward the needs of disabled persons than ever before—as well as im- 
proved guidance on how to conduct testing with them. In addition, new advances 
in' instrumentation, such as voice synthesizers that simulate speech and other 
computer-controlled electronic devices, offer the possibility of a wide range of 
promising testing innovations, including many that would be of value in this area 
(see, e.g., Educational Testing Service, 1992; Wilson, 1991). 

The following sections consider special testing problems and procedures with 
reference to three major categories of physical disability—namely, hearing, visual, 
and motor impairments. 


Hearing Impairments.!5 Owing to their general retardation in language 
development, hearing-impaired children are usually handicapped by verbal tests, 
even when the verbal content is presented visually. The earlier the onset of 
deafness in young children, the more severe is this handicap. Fortunately, recent 
advances in the assessment of auditory functioning have made it possible to 
diagnose hearing loss accurately—and start habilitation procedures—within the 
first few months of life (Shah & Boyden, 1991). 

The testing of deaf children was the primary object in the development of 
some of the earliest performance scales, such as the Pintner-Paterson Performance 
Scale and the Arthur Performance Scale. Special adaptations of the Wechsler 
scales are often employed in testing deaf persons. Most of the verbal tests can be 
administered if the oral questions are typed on cards. Various procedures for com- 
municating the instructions for the performance tests have been worked out (see, 
e.g., Sattler, 1988, 1992); in fact, the WISC-R Performance Scale has been the 
most widely used test of intelligence for hearing-impaired children in the United 


18For reviews of the issues and considerations pertinent to the assessment of hearing-i i ildre: 
earing-impaired children, see 
Bradley-Johnson and Evans (1991), Y. Mullen (1992), and Sullivan and Burley (1990). 
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States. With such modifications of standard testing procedures, however, one can- 
not assume that reliability, validity, and norms remain unchanged. Given the ex- 
tensive use of the Wechsler scales with the hearing impaired, however, there is a 
copious literature on their psychometric. properties when applied. to samples of 
hearing-impaired individuals (see, e.g., Braden, 1985; Maller & Braden, 1993; 
Sullivan & Schulte, 1992). In general, these studies indicate that there are sub- 
stantial similarities in the factor structure of the scales and in the predictive and 
construct validity of the Performance Scale’ for hearing-impaired and nonim- 
paired children. 

The tests mentioned thus far were standardized on hearing persons. Many inż 
vestigators have concluded that when levels of performance are comparable to 
those of hearing persons, as is the case for the Wechsler Performance scales, there 
is no need for separate norms for hearing-impaired individuals. At the same time, 
norms obtained on deaf children are useful in a number of situations pertaining to 
their educational development. To meet this need, some efforts have been made 
to establish special norms for existing tests, as illustrated by the standardization of 
the WISC-R Performance Scale for Deaf Children (R. J. Anderson & Sisco, 
1977). 

Ata more basic level, the Hiskey-Nebraska Test of Learning Aptitude (Hiskey, 
1966) was developed and standardized on deaf and hard-of-hearing children. This 
is an individual test suitable for ages 3 to 17. Speed was eliminated, since it is dif- 
ficult to convey the idea of speed to deaf children. An attempt was also made to 
sample a wider array of intellectual functions than those covered by most perfor- 
mance tests. Pantomime and practice exercises that communicate the instruc- 
tions, as well as intrinsically interesting items that establish rapport, are used in 
the test. Items were chosen with special reference to the limitations of deaf chil- 
dren, with final selection being based chiefly on the criterion of age differentia- 
tion. Norms were derived separately from 1,079 deaf and.1,074 hearing children; 
parallel instructions for testing both types of children are provided in the manual, 
which contains a discussion of desirable practices to be followed in testing deaf 
children. Although the norms for it are dated, the Hiskey-Nebraska has adequate 
reliability and validity evidence and is still considered one of the best tests for use 
with hearing-impaired children (Sullivan & Burley, 1990). : 

Over the past five decades the evolution of knowledge about the intellectual 
consequences of deafness has been astonishing. Much of this history is discussed 
by Braden (1994) in a comprehensive review of over 200 studies of deaf people 
involving more than 170,000 research participants. This work describes many of 
the intriguing findings in this literature, including the fact that deaf children who 
have deaf parents score above the norms for non-impaired children: on perfor- 
mance tests. Although the reasons for these findings are not yet fully understood, 
there is no question that deafness is a far more complex variable than was previ- 
ously thought. Etiology, extent, ages of onset and detection of hearing loss, as well 
as mode of communication, educational placement, parental hearing status, and 
presence of additional disabilities are all factors that interact and contribute to 
the differential cognitive performance of hearing-impaired individuals. 
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Visual Impairments.!? Testing the blind presents a very different set of 
problems from those encountered with the deaf. Oral tests can be most readily 
adapted for blind persons, while performance tests are least likely to be applicable. 
In addition to the usual oral presentation by the examiner, other suitable testing 
techniques have been utilized, such as tape recordings. Some tests, such as the 
College Board Scholastic Assessment Test (SAT), are also available in large-type 
formats or in braille. The latter technique is somewhat limited in its applicability, 
however, because of: the greater bulkiness of materials printed in braille compared 
to those provided in ink; the slower reading rate for braille; and the number of 
blind persons who are not facile braille readers. The test taker's responses may 
likewise be recorded in braille or on a keyboard. Specially prepared embossed 
answer sheets or cards are also available for use with true-false, multiple-choice, 
and other selected-response items. In many individually administered tests, of 
course, oral or gestural responses can be obtained. 

Among the earliest examples of general intelligence tests that have been adapted 
for blind persons is the Binet. The first Hayes-Binet revision for testing the blind 
was based on the 1916 Stanford-Binet. In 1942, the Interim Hayes-Binet^? was pre- 
pared from the 1937 Stanford-Binet (Hayes, 1942, 1943). The most recent adapta- 
tion—comparable to the Stanford-Binet Form L-M—is the Perkins-Binet Tests of 
Intelligence for the Blind. This instrument was standardized on and has separate 
forms for partially sighted and blind children (C. J. Davis, 1980). 

The Wechsler scales have also been adapted for blind test takers. These adap- 
tations consist essentially in using the verbal tests and omitting the performance 
tests. A few items inappropriate for the blind are replaced by alternates. In gen- 
eral, the studies of children who have poor vision or blindness suggest that these 
conditions may have a negative impact on their cognitive development, even in 
the verbal area, because of the limitations such conditions impose on the range 
and variety of their experiences. The Wechsler profiles of visually impaired chil- 
dren have shown a similar pattern across various studies; the results suggest that 
the factorial composition of the tasks is different for them than it is for normally 
sighted children. Although global IQs cannot be considered accurate measures of 
the overall cognitive functioning of visually impaired children, in the hands of 
informed examiners, the Wechsler scales can provide useful diagnostic informa- 
esee the strengths and weaknesses of these children (Groenveld & Jan, 


Very few instruments have been developed specifically for use with visually im- 
paired persons. Possibly the best known example of these is the Blind Learning 
Aptitude Test (BLAT—Newland, 1979). The BLAT is an individually adminis- 
tered test that incorporates items adapted from other tests, such as Raven's Pro- 
gressive Matrices, and other nonverbal items, and presents them in an embossed 


For reviews of issues and procedures in the assessment of visually impaired children, see Bradley-Johnson 
(1994), Fewell (1991), M. S. Moore and McLaughlin (1992), and Orlansky (1988). 

Originally designated as an interim edition because of the tentative nature of its standardization, this revision 
has come to be known by this name in the literature. 
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format. Emphasis is placed-on the learning process rather than on the products of 
past learning, which might handicap the blind child. Normative data, though 
dated, compare favorably with those usually available for instruments designed for 
special populations. Information regarding reliability and validity is scant and re- 
quires further research. Nevertheless, the BLAT can be a useful component, along 
with verbal tests, in the evaluation of blind children of elementary school age. 

A more recent example, still in the developmental stages, has derived from the 
attempt of a group of investigators in the Netherlands to provide a comprehen- 
sive measure of intelligence for children with visual impairments. The Intelli- 
gence Test for Visually Impaired Children (ITVIC—Dekker, Drenth, Zaal, & 
Koole, 1990) incorporates haptic or tactile versions of tasks such as Block Design 
into a battery that includes several nonverbal and verbal subtests.?! Additional 
research on the ITVIC, with a wider range of subjects, is needed; however, the 
initial studies show promising results.in terms of the validity of the instrument 
(Dekker, 1993; Dekker, Drenth, & Zaal, 1991; Dekker & Koole, 1992). 

As is the case with all the other special conditions discussed in this chapter, vi- 
sual impairments occur ina wide range of gradations and quite often in combina- 
tion with other problems. Thus, the decision of whether to use standard tests, 
adaptations of them, or specially designed tests for the blind depends on the ob- 
jectives of the assessment and the unique characteristics of the person in ques- 
tion. In general, test users should always remember that modifications of tests, 
such as tactile presentations of visual designs or extended time limits, cannot be 
assumed to measure the same constructs as the original versions. 


Motor Impairments.*? Although usually able to receive auditory and visual 
stimulation, persons with: orthopedic disabilities may have such severe motor 
disorders as to make oral or written responses impracticable: The attempt to 
manipulate formboards or other performance materials would likewise meet diffi- 
culties. Working against a time limit or in strange surroundings often increases 
the motor disturbance in the orthopedically impaired. Their greater susceptibility 
to fatigue makes short testing sessions necessary. 

Some of the severest motor disabilities are found among persons with cerebral 
palsy. Yet surveys of these cases have often employed common intelligence tests 
such as the Stanford-Binet. In such studies, the most severely disabled were usu- 
ally excluded as untestable. Frequently, informal adjustments in testing procedure 
were made to adapt the test to the child’s response capacities. Both of these pro- 
cedures, of course, are makeshifts. 


2t A i wea ed on blind adults during the 1950s and 
The Haptic Intelligence Scale is a similar measure developed and norm ki 
early 1960s (Shurrager & Shurrager, 1964). It consists of six performance subtests ned xi De deier 
Bellevue Intelligence Scale—namely, Digit Symbol, Object Assembly, Block Design, Objecr Completion, Pa 
tern Board, and Bead Arithmetic. ds uet 
22 ] 1 d in young children can fo 
A review of measures useful in the assessment of gross motor functions in young 
H. G. Williams (1991). c Robinson and Fieber (1988) describe a process-oriented approach to the assessment 
of young children using Piagetian sensorimotor and preoperational tasks. 
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A more satisfactory approach lies in the development of testing instruments 
suitable for even the most severely impaired individuals. A number of specially 
designed tests or adaptations of existing tests are now available for this purpose, 
although their normative and validity data are often meager. Several of the tests 
to be discussed in the next section, originally designed for use in cross-cultural 
testing, have also proved applicable to disabled persons. Adaptations of the Leiter 
International Performance Scale and the Porteus Mazes, suitable for administra- 
tion to children with cerebral palsy, have been prepared (Allen & Collins, 1955; 
Arnold, 1951). In both adapted tests, the examiner manipulates the test materials 
while the test taker responds only by appropriate head movements. Raven's Pro- 
gressive Matrices provide a useful tool for this purpose as well. Since this test is 
given with no time limit, and since the response may be indicated orally, in writ- 
ing, or by pointing or nodding, it appears to be especially appropriate for orthope- 
dically disabled individuals. Despite the flexibility and simplicity of its response 
indicator, this test—in its various forms—covers a wide range of difficulty and 
provides a fairly high test ceiling. Successful use of this test has been reported in 
studies of individuals with cerebral palsy and other motor disorders (see, e.g., 
Capitani, Sala, & Marchitti, 1994). 

Another type of test that permits the utilization of a simple pointing response 


which makes them especially applicable to persons unable to vocalize well (such 
minister and can be completed in about 15 minutes, picture vocabulary tests are 
also useful as a rapid screening device in situations where administration of a 


comprehensive individual intelligence test is not feasible. 
The Peabody Picture Vocabulary Test is typical of these instruments. Its cur- 


other. Raw scores can be converted to standard scores (M=100, SD=15), per- 
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search base that had been established on the PPVT, with which it has a median’ 
correlation of .70. A survey of over 300 studies using the PPVT yielded high cor- 
relations with other vocabulary tests, moderate correlations with tests of verbal 
intelligence and scholastic aptitude, and promising relations with performance 
on educational achievement tests. Correlations were similar in diverse popula- 
tions, including economically disadvantaged groups and samples of people with 
disabilities and mental retardation. Scores on the PPVT reflect in part the re- 
spondent's degree of cultural assimilation and exposure to Standard American 
English. 5 

Studies using the PPVT-R itself indicate that it also correlates highly with 
other measures of verbal comprehension (see, e.g., Elliott, 1990b, p. 235). A par- 
ticularly interesting study of the PPVT-R, using a structural equation model of ac- 
quisition order of words, offers substantial support for the construct validity of the 
instrument (L. T. Miller & Lee, 1993). A third edition of the Peabody Picture Vo- 
cabulary Test-the PPVT-III-along with an expressive vocabulary test with which 
it has been conormed, will be published in 1997. 

Similar procedures of test administration have been incorporated in picto- 
rial classification tests, as illustrated by the Columbia Mental Maturity Scale 
(CMMS—Burgemeister, Blum, & Lorge, 1972). Originally developed for use with 
children with cerebral palsy, this scale comprises 92 items, each consisting of a set 
of 3, 4, or 5 colored drawings printed on a large card. The examinee is required to 
identify the drawing that does not belong with the others, indicating her or his 
choice by pointing or nodding (see Fig. 9-3). The standardization sample for the 
CMMS consisted of 2,600 children between the ages of 3:6 and 9:11 and was rep- 
resentative of the U.S. population as of 1960. Split-half and retest reliabilities 
range from .84 to .91. A correlation of .67 with Stanford-Binet IQ was found in a 
gtoup of 52 preschool and first-grade children. Correlations with achievement 
test scores in first- and second-grade samples fell mostly between the high .40s 
and the low .60s. Extensive data on validity and on applicability of the CMMS to 
various groups of disabled individuals are available for the earlier and current 
forms of the test (see Tests in Print Il, Ill, and IV). However, because of its out- 
dated norms and the narrow range of abilities it assesses, the applicability of the 
CMMS is fairly limited. 


MULTICULTURAL TESTING 


The Problem. The testing of persons with highly dissimilar cultural back- 
grounds has received increasing attention since the 1950s. Tests are needed for 
maximum utilization of human resources in the newly developing nations in 
many parts of the world. The rapidly expanding educational facilities in these 
Countries require testing for admission purposes as well as for individual 
counseling. With increasing industrialization, there is a mounting demand for 
tests to aid in the job selection and placement of personnel, particularly in the 
information processing, mechanical, and professional fields. 
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Figure 9-3. Examiner Administering Columbia Mental Maturity Scale to Child. 


(From Columbia Mental Maturity Scale: Guide for administering and interpreting, Burgemeister et al 1972, p. 11, 
Copyright © 1972 by Harcourt Brace Jovanovich, Inc. Reproduced by permission.) 


widespread concern regarding the applicability of available tests to culturally disad- 
vantaged persons. It should be clearly recognized that cultural disadvantage is a rel- 
ative concept. Objectively, there is only cultural difference between any two 


pete within a culture or subculture other than that in Which he or she was reared, 
then cultural difference may become cultural disadvantage—or, indeed, advantage. 

Although concern with cross-cultural testing has been greatly stimulated by re- 
cent social and political developments, the problem was recognized at least as 
early as 1910. Some of the earliest cross-cultural tests were developed for testing 
large waves of immigrants coming to the United States during the first decades of 
the twentieth century (Knox, 1914). Others originated in basic research on the 


"Multicultural testing” is also widely referred to as cross-cultural or transcultural testing. 
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comparative abilities of relatively isolated cultural groups. These cultures often 
had had little or no contact with Western, technologically advanced societies 
within whose framework most psychological tests had been developed.”4 ` 

Traditionally, cross-cultural tests have tried to rule out one or more parameters 
along which cultures vary. A well-known example of such a parameter is language. 
If the cultural groups to be tested spoke different languages, tests were developed 
that required no language on the part of either examiner or test takers. When ed- 
ucational backgrounds differed widely and illiteracy was prevalent, reading was 
ruled out. Oral language was not eliminated from these tests because they were 
designed for persons speaking a common language. Another parameter in which 
cultures or subcultures differ is that of speed. Not only the tempo of daily life, but 
also the motivation to hurry and the value attached to rapid performance vary 
widely among national cultures, among ethnic minority groups within a single 
nation, and between urban and rural subcultures (see, e.g., Klineberg, 1928; R.R. 
Knapp, 1960; M. Womer, 1972). Accordingly, cross-cultural tests have often— 
though not always—tried to eliminate the influence of speed by allowing long 
time limits and giving no premium for faster performance. 

Still other parameters along which cultures differ pertain to test content. Sev- 
eral nonlanguage and nonreading tests, for example, call for items of information 
that are specific to certain cultures. Thus, they may require the test taker to un- 
derstand the function of such objects as violin, postage stamp, gun, pocketknife, 
telephone, piano, or mirror. Persons reared in certain cultures may lack the expe- 
riential background to respond correctly to such items. It was chiefly to control 
this type of cultural parameter that the classic “culture-free” tests were first devel- 
oped. Following a brief examination of typical tests designed to eliminate one or 
more of the above parameters, we shall turn to an analysis of alternative ap- 
proaches to cross-cultural testing. 


Typical Traditional Instruments.?? In their efforts: to construct tests 
applicable across cultures, psychometricians followed a variety of procedures, 
some of which are illustrated by the tests considered in this section. The Leiter 
International Performance Scale-Revised (Roid & Miller, 1997) is an individually 
administered performance test, originally published in 1940. It was developed 
through several years of use with different ethnic groups in Hawaii. The scale was 
subsequently applied to several African groups by Porteus and to a few other 
national groups by other investigators. A revision issued in 1948 was based on 
further testing of U.S. children, high school students, and Army recruits during 
World War Il. The 1997 edition is based on samples of over 2,000 normal and 
atypical individuals from the U.S. between the ages of 2 and 20 years. A 
distinctive feature of the Leiter, since adopted by other instruments, is the almost 
complete elimination of verbal instructions. Each test begins with a very easy task 


*4For examples of early tests, see Anastasi (1954, chap.10). ' 
"SRevieus of several of the nonverbal measures discussed in this and the preceding sections, as well as other such 
tests, can be found in Naglieri and Prewett (1990). 
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Figure 9-4, Typical Materials for Use in the Leiter International Performance 
Scale—Revised. The partially completed task that is illustrated belongs in the Sequential 


Order Test and requires the examinee to select five out of the six response cards and 
place them correctly on the tray. 


(Courtesy of the Stoelting Company.) 


placing them on a response tray, as seen in Figure 9.4. 

The Leiter scale was designed to cover a wide range of functions, similar to 
those found in verbal scales. In its current form, the coverage has been expanded 
significantly and now encompasses four domains, namely, Reasoning, Visualiza- 


chometric characteristics, For example, item response theory was used to calibrate 
difficulty levels, and Scores are no longer expressed as ratio Is. In addition, the 
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availability of representative contemporary norms and the expanded content 
should enhance the usefulness of the scale. The new manual includes information 
on various types of reliability and validity evidence. 

Raven's Progressive Matrices (RPM) was designed primarily as a measure of 
Spearman's g factor or general intelligence (J. Raven, 1983; Raven, Raven, & 
Court, 1995). In keeping with Spearman’s theoretical analysis of g, this test re- 
quires chiefly the eduction of relations among abstract items. The items consist of 
aset of matrices, or arrangements of design elements into rows and columns, from 
each of which a part has been removed. The task is to choose the missing insert 
from given alternatives. The easier items require accuracy of discrimination; the 
more difficult items involve analogies, permutations and alternations of patterns, 
and other logical relations. Two typical items from the Standard Progressive Matrices 
are illustrated in Figure 9-5. The test is usually administered with no time limits 
and can be given individually or in groups. Very simple oral instructions are required. 

The RPM is available in three forms, differing in level of difficulty. The Stan- 
dard Progressive Matrices (SPM—1996 Edition) is the form suitable for average 
individuals between the ages of 6 and 80 years. An easier form, the Coloured Pro- 
gressive Matrices (CPM—1990 Edition), is available for younger children and for 
special groups who cannot be adequately tested with the SPM for various reasons. 
Norms for the CPM are available for children from 5% to 11% and for samples of 
nonretarded persons aged 60 to 89 years and of mentally retarded adults. A third 
form, the Advanced Progressive Matrices (APM—1994 Edition) was developed 
for above-average adolescents and adults. 


A) CD | 


Figure 9-5. Typical Items from the Standard Progressive Matrices (SPM). One 
easy item (A5) and one difficult item (El) are shown. 
(Reproduced by permission of J. C. Raven Ltd.) 
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The manual for all levels of the RPM is available in sections that can be pur- 
chased separately or in any desired combination and can be assembled in a single 
binder. Section 1 gives a general overview and was last updated in 1995; the pub- 
lication dates of the other sections range from 1990 to 1996. They include the 
specific manuals for each of the three levels of the RPM. The manuals for two vo- 
cabulary tests standardized for use in combination with the RPM are also part of 
the Raven's package. The last section of the manual summarizes additional re- 
search on reliability, validity, and supplementary norms obtained in various coun- 
tries and on special populations (Court & Raven, 1995). Several research 
supplements with British standardization data and normative information gath- 
ered in North America, Ireland, and Germany, as well as an annotated bibliogra- 
phy of over 2,000 studies are also available.26 

Although there is a large accumulation of published research on the'RPM, the 
studies are scattered and heterogeneous, having been designed for a variety of dif- 
ferent purposes. The test authors recommend that potential users identify those 
studies and populations most relevant to their own interests; but they caution 
that the studies vary widely in methodology, size of. samples; and technical quality. 

In general, retest reliability in. groups of older children and adults that were 
moderately homogeneous in age ranges approximately from .70 to .90. At the 
lower score ranges, however, reliability falls considerably below these values. In- 
ternal consistency coefficients are mostly in the .80s and .90s. Correlations with 
both verbal and performance tests of intelligence range between .40 and bg 


accuracy, and other group factors also influence performance. 

A still different approach to nonverbal testing is illustrated by the Goode- 
nough Draw-a-Man Test, in which the test taker is simply instructed to “make a 
picture of a man; make the very best picture you can." This test was in use with- 
out change from its original standardization in 1926 until 1963. An extension 
and revision was published in 1963 with the title of Goodenough-Harris Drawing 
Test (D. B. Harris, 1963). In the revision, as in the original test, emphasis is 
placed on the child's accuracy of observation and on the development of concep- 
tual thinking, rather than on artistic skill. Credit is given for the irclusion of in- 
dividual body parts, clothing details, Proportion, perspective, and similar features. 
A total of 73 scorable items were selected on the basis of age differentiation, rela- 
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a ee REN S obtained on disk and in hard copy from LH Court, at an address on record with the pub- 
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In the revised scale; examinees are also asked to draw a picture of a woman and 
of themselves. The Woman scale is scored in a manner similar to the Man scale. 
The Self scale was developed as a projective test of personality, although findings 
from this application are not promising." Point scores on each scale are trans- 
muted into standard scores with a mean of 100 and an SD of 15. Figure 9-6 shows 
three illustrative drawings produced by children aged 5:8, 8:8, and 12:11, together 
with the corresponding raw scores’ and standard scores. The reliability of the 
Goodenough-Harris Drawing Test has been repeatedly investigated by a variety of 
procedures. Retest correlations, as well as:split-half and scorer reliability figures 
are adequate; the effect of art training in school on scores is apparently negligible 
(J-A. Dunn, 1967; D.B. Harris, 1963). 

Apart from the item-analysis data gathered in the development of the scales, 
information regarding the construct validity of the test is provided by correlations 
with other intelligence tests. These correlations vary widely, but the majority are 
over .50. For kindergarten children, the Draw-a-Man Test correlated higher with 


Man: Raw Score 66 
CA 12-11 


Standard Score 73 Standard Score 103 Standard Score 134 


Man: Raw Score 7 Woman: Raw Score 31 
CA 5-8 CA 8-8 


Figure 9-6, Specimen Drawings Obtained in Goodenough-Harris Drawing Test. 


(Courtesy of Dale B. Harris.) 


drawings are discussed in chapter 15. These include 


219, : 
th ve à f human fi 
«piece p egi asses both cognitive and emotional aspects. 


Elizabeth Koppitz’s work with this technique, which emcomp 
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numerical aptitude and lower with perceptual speed and accuracy than it did for 
fourth-grade children (D. B. Harris, 1963). These findings suggest that the test 
may measure somewhat different functions at different ages. Both versions have 
been employed in a large number of studies on different cultural and ethnic 
groups. Such investigations have indicated that performance on this test is more 
dependent on differences in cultural background than was originally assumed. 
Dennis (1966), for example, analyzed comparative data obtained with the test in 
40 widely different cultural groups and found that mean group scores appeared to 
be most closely related to the amount of experience with representational art 
within each culture. 

Cultural differences in experiential background were again revealed in a well- 
designed comparative investigation of children from Mexico and the U.S. with 
the Goodenough-Harris test (Laosa, Swartz, & Diaz-Guerrero, 1974). In a more 
recent large study of this test in Iran, the mean scores of 6- to 13-year-old children 
were somewhat lower than U.S. norms, but the scores showed good age differenti- 
ation and positive correlations with socioeconomic status and academic achieve- 
ment measures (Mehryar, Tashakkori, Yousefi, & Khajavi, 1987). It should be 
added that these findings with the Goodenough-Harris test are typical of results 
obtained with all tests initially designed to be "culture-free" or "culture-fair" 
(Samuda, 1975, chap. 6). 

A new version of the Draw-a-Man test, intended to update and improve the 
technical qualities of the Goodenough-Harris version, is now available under the 
title of Draw A Person: A Quantitative Scoring System (DAP—Naglieri, 1988). 
The DAP provides more recent and more detailed norms, but somewhat different 
administration procedures, as well as a revised Scoring system that attempts to be 
less ambiguous than the Goodenough-Harris Drawing Test. In addition, it in- 
cludes normative data on samples of Black and Hispanic children. In spite of the 
improvements, however, this revision has been criticized for its relatively limited 
scope and for the lack of evidence concerning the advantages of the revised scor- 
ing system (Cosden, 1992). 

A final point should be reiterated about the instruments discussed in this sec- 
tion. Although initially developed for cross-cultural testing, several of them have 
found major applications in the armamentarium of clinical and counseling psy- 
chologists, both to supplement information obtained from such instruments as 
the Stanford-Binet and Wechsler scales and as one source of input in the testing 
of persons with various disabilities. This recognition has led to the preparation of 
a new generation of such measures. One of these, already in its second edition, is 
the Test of Nonverbal Intelligence (TONL-2—L. Brown, Sherbenou, & Johnsen, 
1990), which is similar to the RPM in its content and range of applicability (for 
reviews of the se see K. R. Murphy, 1992 and Watson, 1992). Other major 
new instruments of this type are currently ui i izati ill be 
SERGE E y undergoing standardization and will be 
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Approaches to Cross-Cultural Testing. Theoretically, we can identify 
three approaches to the development of tests for persons reared in different 
cultures or subcultures, although in practice some features from all three may be 
combined. The first approach involves the choice of items common to many 
different cultures and the validation of the resulting test against local criteria in 
many different cultures. This is the basic approach of the culture-fair tests, 
although their repeated validation in different cultures has often been either 
neglected altogether or inadequately executed. Without such a step, however, we 
cannot be sure that the test is relatively free from culturally restricted elements. 
Moreover, it is unlikely that any single test could be designed that would fully 
meet these requirements across a wide range of cultures. 

Nevertheless, multicultural assessment techniques are needed for basic re- 
search on some fundamental questions. One of these questions pertains to the 
generality of psychological principles and constructs derived within a single cul- 
ture (Anastasi, 1958, chap. 18; Berry et al., 1992; Irvine, 1983; Irvine & Carroll, 
1980). Another question concerns the role of environmental conditions in the 
development of individual differences in behavior—a problem that can be more 
effectively studied within the wide range of environmental variation provided by 
highly dissimilar cultures. Research of this sort calls for instruments that can be 
administered under at least moderately comparable conditions in different cul- 
tures. Safeguards against incorrect interpretations of results obtained with such 
instruments should be sought in appropriate experimental designs and in the in- 
vestigators' thorough familiarity with the cultures or subcultures under investiga- 
tion. What is required is the identification of particular experiential variables 
within any given culture that may be related to socially significant differences in 
behavioral development characterizing that culture (J. W. Berry, 1983; Brislin, 
1993; Segall, 1983; Whiting, 1976). An outstanding example—from the realm of 
personality testing—of how such research efforts may be carried out can be found 
in the Cross-Cultural Anxiety Series of publications. This series addresses itself 
exclusively to the measurement of anxiety across different cultures and has been 
extraordinarily fruitful in expanding the knowledge base about this construct and 
about how anxiety is experienced by people from around the world (see, e.g., 
Spielberger & Diaz-Guerrero, 1990). Ga 

A second major approach is to develop a test within one culture and adminis- 
ter it to persons with different cultural backgrounds. In this case, we should avoid 
the mistake of regarding any test developed within a single cultural framework as 
a universal yardstick for measuring “intelligence” or other constructs. Nor should 
we assume that a low score on such a test has the same causal explanation for 
both a member of the test culture and for a member of another culture. What can 
be ascertained by such an approach is the cultural distance between groups, as 
well as the individual's degree of acculturation and her or his readiness for educa- 
tional and vocational activities that are culture-specific. Some investigators have 
sought to dramatize the fact that the cultural milieu in which an individual is 
reared affects the cognitive skills and knowledge he or she acquires. Early exam- 
ples include a footprint recognition test standardized on aboriginal Australians 
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(Porteus, 1931) and a Draw-a-Horse Test standardized on Pueblo Indian children 
(DuBois, 1939). In both cases, the cultural group on which the test was developed 
excelled in comparisons with other groups. 

Asa third approach, different tests (or substantial adaptations of existing tests) 
may be developed within each culture, validated against local criteria, and used 
only within the appropriate culture. This approach is illustrated by the develop- 
ment of tests for industrial and military personnel within particular cultures. One 
example is provided by the test-development program conducted in several devel- 
oping nations of Africa, Asia, and Latin America under the sponsorship of the 
United States Agency for International Development (Schwarz & Krug, 1972). 
In such instances, the tests are validated against the specific educational and vo- 
cational criteria they are designed to predict, and performance is evaluated in 
terms of local norms. Each test is applied only within the culture in which it was 
developed, and no cross-cultural comparisons are attempted. If the criteria to be 
predicted are technological, however, "Western-type intelligence" is likely to be 
needed, and the tests will reflect the direction in which the particular culture is 
evolving rather than its prevalent cultural characteristics at the time. In addition, 
as a recent survey of global use of tests indicates, the current reality is that, on a 
global basis, the most frequently used tests—at least with children and youths— 
are those constructed in the United States and Europe. In fact, among all the cat- 
egories of nations participating in the survey, the least developed countries, 
which are probably the most different from the U.S. and European countries, are 
also the ones that rely most heavily on foreign testing technology (Hu & Oak- 
land, 1991; Oakland & Hu, 1992)? 

An extensive literature has accumulated on the psychological testing of cul- 
tural minorities within pluralistic societies such as the United States, Israel, and 
the Netherlands (see, eig., Bleichrodt & Drenth, 1991; Duran, 1989; Figueroa, 
1990; Hessel & Hamers, 1993; Samuda, Kong, Cummins, Lewis, & Pascual-Leone, 
1991; Zeidner, 1988). In the present book, this material is treated wherever it can 
be most clearly presented. Thus, in chapter 18, the focus is on ethical and social 
concerns and responsibilities in the use of tests with cultural minorities. The 
technical psychometric problems of test bias and item-group interactions were 
considered in chapters 6 and 7. In the present chapter, the focus was on instru- 
ments developed for cross-cultural ability testing. Problems in the interpretation 
of the results of cross-cultural testing, together with current trends, will be consid- 
ered in chapter 12. 

Today, multicultural testing is moving away from the construction of special 
tests and is focusing more and more on the role of the examiner during the testing 
process. Basically, it is the responsibility of the examiner to: (1) obtain informa- 
tion about the examinee's cultural background; (2). choose a test that is best 
suited for the purpose for which it is being used; (3) present and administer the 


Din recognition of this reality, the International Test Commission has prepared a coherent and carefully crafted 
set of guidelines for adapting educational and psychological tests (Hambleton, 1994, 1996; Van de Vijver & 
Hambleton, 1996). Many of the same issues are addressed by Geisinger (1994), 
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test effectively for the particular individual; and (4) interpret the test results in 
the light of both the individual's experiential background and the context (occu- 
pational, educational, community, etc.) for which the: examinee's qualifications 
are being assessed. These features of the examiner's role will be discussed further 
in chapter 12. 


The Assessment of Environment. Although an examination of traditional 
cross-cultural tests is of historical interest and thereby enhances the understand- 
ing of the origin and nature of current tests, the rapidly increasing contacts 
among world cultures is radically altering the need for such tests. More and more, 
effective tests will be developed, or adapted, within particular cultures for specific 
purposes—for example, educational, industrial, and: counseling uses. The search 
for a universal test of human intelligence is being recognized as futile, because 
of the growing awareness of the significant contribution of the individual's envi- 
ronment and experiential history to the form their intelligence takes. These 
conditions have led to increased activity to assess the environment in which the 
individual functions.?? 

The traditional approach to environmental assessment has relied on some 
global, composite index of socioeconomic level. Sociologists have utilized elabo- 
rate procedures for identifying an individual's social class membership (Warner, 
Meeker, & Eells, 1949). Simpler and more readily applicable indices, however, 
have proved to be equally effective, yielding results that agree very closely with 
those obtained by the more laborious methods. In fact, a reasonably close approx- 
imation of socioeconomic level can be found from the occupation of the principal 
support provider in the family. Several rough scales have been constructed for 
classifying parental occupation into levels; some combine occupational informa- 
tion with parental educational level, as in the widely used Two-Factor Index of 
Social Position. First described by Hollingshead (1957), this index can be found 
in several sources (e.g., Bonjean, Hill, & McLemore, 1967; Hopkins & Stanley, 
1981). More objective procedures for recording occupational information and de- 
riving an occupational-level index have also been developed (Duncan, 1961; 
Stricker, 1985). 

A major limitation of the traditional global indices stems from the fact that 
they classify environments along a single continuum of better-or-worse or higher- 
or-lower. Environments, however, differ in the particular behaviors they rein- 
force, and hence in their effects on specific individual characteristics (see, e.g., 
McAndrew, 1993). Thus the optimal environments for the development of ath- 
letic skills, school achievement, creativity, and social conformity may be quite 
dissimilar. Valuable guidance on an empirical approach to the classification. and 
description of behavior settings can be found in the revision of Roger Barker's pi- 
oneering work on ecological psychology (Schoggen, 1989). : * 

Cross-cultural testing highlights the important role that parenting and zg 
home environment play in the intellectual development of the growing chi 
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This point is discussed further in chapter 12. 
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(see, e.g., M. H. Bornstein, 1991). It is now also recognized that such environmen- 
tal differences are not limited to clearly identifiable cultural or ethnic populations 
but may significantly affect any person's psychological development. Moreover, en- 
vironments need to be more specifically identified in terms of the particular behav- 
ior they foster. Increasing attention has been given to the more precise assessment 
of the psychological effects of different home and family environments. 

Many measures and techniques of various types for the assessment of families 
and the home environment are now available (Bradley & Brisby, 1993; Paget, 
1991). A well-known and widely used home environment inventory is the Home 
Observation for Measurement of the Environment (HOME—B. M. Caldwell & 
Bradley, 1984). This instrument focuses on the types of stimulation and parental 
behavior in the home environment that foster cognitive development (Bradley & 
Caldwell, 1984; B. M. Caldwell & Bradley, 1978; J. H. Stevens & Bakeman, 
1985). The HOME inventory is currently available in three versions, designed for 
use: from birth to 3 years, from 3 to 6 years, and from 6 to 10 years. HOME pro- 
vides scores on several scales assessing such variables as provision of appropriate 
play materials, variety of stimulation, language stimulation, encouragement of so- 
cial maturity, and stimulation of academic behavior (for review, see Boehm, 
1985). Indices of the family socioeconomic status (SES) of infants correlate with 
intellectual performance in early childhood ae well as or better than HOME 
scores. However, the combination of SES and HOME scores can add to the pre- 
dictability of intelligence in certain circumstances (see, èig; D. L. Johnson et al., 
1993). Moreover, the variables assessed by HOME and similar instruments can 
contribute unique and valuable information to the assessment of children for 
many other purposes. 


Group Testing 


hile individual tests such as the Stanford-Binet and the Wechsler scales find 

> their principal application in the clinic, group tests are used primarily in the 
!J educational system, government service, industry, and the military services. 
SET" It will be recalled that mass testing began during World War I with the devel- 

opment of the Army Alpha and the Army Beta for use in the United States Army. 
The former was a verbal test designed for general screening and placement pur- 
poses, The latter was a nonlanguage test for use with persons who could not prop- 
erly be tested with the Alpha owing to foreign-language background or illiteracy. 
The pattern established by these tests was closely followed in the subsequent de- 
velopment of a large number of group tests for civilian application. 

Revisions of the civilian forms of both original Army tests continued in use for 
several decades. In the armed services, the Armed Forces Qualification Test 
(AFQT) was subsequently developed as a preliminary screening instrument, fol- 
lowed by multiple-aptitude classification batteries for assignment to occupational 
specialties, The AFQT provided a single score based on an equal number of vo- 
cabulary, arithmetic, spatial relations, and mechanical ability items. Still later, 
the Armed Services Vocational Aptitude Battery (ASVAB) was developed for 
use in all the armed services as a composite selection and classification battery. 
Some of the subtests of ASVAB serve as the common qualification component. 
For personnel classification, each service selects and combines subtests to fit its 
specific occupational specialty needs. 

In this chapter, we shall consider first the principal differences between group and 
individual tests, This will be followed by an overview of emerging procedures for 
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individually tailored group testing and computer utilization in testing programs. 
Then we shall introduce some current examples of broad-range group tests for gen- 
eral use. Finally, we shall consider a major current trend in test development and use 
that is discernible in group tests, as well as in the individual tests discussed in chap- 
ter 8. This trend is toward a merging of tests originally designed as overall measures 
of a single comprehensive ability (e.g., intelligence, academic aptitude) with multi- 
ple aptitude batteries. More and more ability tests are being adapted to facilitate 
flexibility of use, whereby a single instrument can yield scores of varying breadth, 
from general to specific, to suit the wide diversity of testing purposes and situations. 


GROUP TESTS VERSUS INDIVIDUAL TESTS 


Typical Differences in Test Design. Group tests necessarily differ from 
individual tests in form and arrangement of items. Although open-ended 
questions calling for free responses could be used—and were used in the early 
group tests—today the typical group test employs multiple-choice items. This 
change was obviously required for uniformity and objectivity of scoring. Another 
major difference between traditional individual and group tests is in the control 
of item difficulty. In individually administered tests, the examiner follows entry 
rules, as well as basal and ceiling rules, to ensure that each test taker is examined 
with items that are appropriate for her or his ability level. In group tests, items of 
similar content are arranged in increasing order of difficulty within separately 
timed subtests. This item organization provides each test taker with an opportunity 
to try each type of item, such as vocabulary, arithmetic, and spatial, and to 
complete the easier items of each type before trying the more difficult ones on 
which he or she might otherwise waste a good deal of time. 

A practical difficulty encountered with separate subtests, however, is that 
the less experienced or less careful examiners may make timing errors. Such er- 
rors are more likely to occur and are relatively more serious with several short 
time limits than with a single long time limit for the whole test. To reconcile 
the use of a single time limit with an arrangement permitting all test takers 
to try all types of items at successively increasing difficulty levels, some tests 
utilize the spiral-omnibus format. One of the earliest tests to introduce this for- 
mat was the Otis Self-Administering Tests of Mental Ability, which, as its 
name implies, endeavored to reduce the examiner's role to a minimum. In a 
spiral-omnibus test, the easiest items of each type are presented first, followed 


by the next easiest type, and so on in a rising spiral of difficulty level, as illus- 
trated below: 


1. The opposite of hate is: Answer 
l. enemy, 2. fear, 3. love, 4. [ean [ors Verse FE 
2. If 3 pencils cost 25 cents, how many pencils can be bought for 75 
EE JOIN d nea Lavell ad line at Lo (ihe) 
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3. A bird does not always have: 


1. wings, 2. eyes, 3. feet; "4a nest; 5* abillo nro (07) 
4. The opposite of honor is: 
l. glory, 2. disgrace, 3. cowardice, 4. fear, 5. defeat ...... (PY 


In order to avoid the necessity of repeating instructions in each item and to re- 
duce the number of shifts in instructional set required of the test takers, some tests 
apply the spiral-omnibus arrangement not to single items but to blocks of 5 to 10 
items. 


Advantages of Group Testing. Group tests are designed primarily as 
instruments for mass testing. In comparison with individual tests, they have both 
advantages and disadvantages. On the positive side, group tests can be 
administered simultaneously to as many persons as can be fitted comfortably into 
the available space and reached through a microphone. Large-scale testing 
programs were made possible by the development of group testing techniques. By 
utilizing only printed items and simple responses that can be recorded on a test 
booklet or answer sheet—or on a computer—the need for a one-to-one relation- 
ship between examiner and examinee was eliminated. 

A second way in which group tests facilitated mass testing was by greatly sim- 
plifying the examiner's role. In contrast to the extensive training and experience 
required to administer the Stanford-Binet, for example, most group tests require 
only the ability to read simple instructions to the test takers and to keep accurate 
time. Some preliminary training sessions are desirable, of course, since inexperi- 
enced examiners are likely to deviate inadvertently from the standardized proce- 
dure in ways that may affect test results. Because the examiner' role is minimized, 
however, group testing can provide more uniform conditions than does individual 
testing. The use of tape-recorded instructions, as well as computer administration, 
offers further opportunities for standardizing procedure and eliminating examiner 
variance in large-scale testing. Scoring is typically more objective in group testing 
and can be done by a clerk. Most group tests can now be scored by computers. 

From another angle, group tests characteristically provide better established 
norms than do individual tests. Because of the relative ease and rapidity of gather- 
ing data with group tests, it is customary to test large, representative samples in 
the standardization process. In the most recently standardized group tests, it is not 
unusual for the normative samples to number between 100,000 and 200,000, in 
contrast to the 1,000 to 8,000 cases laboriously accumulated in standardizing the 
most carefully developed individual intelligence scales. 


Disadvantages of (Gretp Festing- Although group tests have several 
desirable features and serve a well-nigh indispensable function in present-day 
testing, their limitations should also be noted. In group testing, the examiner has 
much less opportunity to establish rapport, obtain cooperation, and maintain the 
interest of the examinees. Any temporary condition of the examinee, such as 


illness, fatigue, worry, or anxiety, that may interfere with test performance is less 
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readily detected in group than in individual testing. In general, persons unaccus- 
tomed to testing may be somewhat more handicapped on group than on individ- 
ual testing. There is also some evidence suggesting that emotionally disturbed 
children may perform better on individual than on group tests (Bower, 1969; 
Willis, 1970). 

From another angle, group tests have been attacked because of the restrictions 
imposed on the examinee's responses. Criticisms have been directed particularly 
against multiple-choice items and against such standard item types as analogies, 
similarities, and classification (Hoffman, 1962; LaFave, 1966). Some of the argu- 
ments are ingenious and provocative. One contention is that such items may pe- 
nalize a brilliant and original thinker who sees unusual implications in the 
answers. It should be noted parenthetically that if this happens, it must be a rare 
occurrence, in view of the item analysis and validity data. If it does occur in one 
or two items in an individual's test, moreover, it would hardly have an apprecia- 
ble effect on that examinee's total score. Some critics have focused on the im- 
portance of analyzing errors and inquiring into the reasons why an. individual 
chooses a particular answer, as in the typical Piagetian approach (Sigel, 1963). It 
is undoubtedly true that group tests provide little or no opportunity for direct ob- 
servations of the examinee’s behavior or for identifying the causes of atypical 
performance. For this and other reasons, when important decisions about indi- 
viduals are to be made, it is desirable to supplement group tests either with indi- 
vidual examination of doubtful cases or with additional information from other 
sources. 

Still another limitation of traditional group testing is its lack of flexibility, inso- 
far as every examinee is ordinarily tested on all items. Available testing time 
could be more effectively utilized if each examinec concentrated on items appro- 
priate to her or his ability level. Moreover, such a procedure would avoid boredom 
from working on too-easy items, at one extreme, and mounting frustration and 
anxiety from attempting items beyond the individual's present ability level, at the 
other. Individual tests typically provide for the examiner to choose items on the 
basis of the test taker's own prior responses. This distinction between individual 
and group tests is especially important when the test is designed to cover a wide 
ability range. 


ADAPTIVE TESTING AND COMPUTER-BASED 
ADMINISTRATION 


Individually Tailored Tests. In the effort to combine some of the advan- 
tages of individual and group testing, several techniques are being explored. 
Major interest thus far has centered on ways of adjusting item coverage to 
the response characteristics of individual test takers. In the rapidly growing 
literature on the topic, this approach has been variously designated as adap- 
tive, sequential, branched, tailored, individualized, programmed, dynamic, or 
response-contingent testing. Although it is possible to design. paper-and-pencil 
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group tests that incorporate such adaptive procedures (Cleary, Linn, & Rock, 
1968; Lord 1971a), these techniques lend themselves best to computerized test 
administration. s I WARE AS deg se 
Adaptive testing can follow a wide variety of procedural models (DeWitt & 
Weiss, 1974; Larkin & Weiss, 1974; Weiss, 1974; Weiss & Betz, 1973). A simple 
example involving two-stage testing is illustrated in Figure 10-1. In this hypo- 
thetical test, all examinees take a 10-item routing test, whose items cover a wide 
difficulty range. Depending on her or his performance on the routing test, each 
examinee is directed to one of the three 20-item measurement tests at different 
levels of difficulty. Thus, each person takes only 30 items, although the entire test 
comprises 70 items. A different arrangement is illustrated in the pyramidal test 
shown in Figure 10-2. In this case, all examinees begin with an item of interme- 
diate difficulty. If an individual's response to this item is correct, she or he is 
routed upward to the next more difficult item; if the response is wrong, he or she 
moves downward to the next easier item. This procedure is repeated after each 
item response, until the individual has given 10 responses. The illustration shows 
a 10-stage test, in which each examinee is given 10 items out of the total pool of 
55 items in the test. The heavy line shows the route followed by one examinee 
whose responses are listed as+ (Right) or — (Wrong) along the top. f 


—— 


į TUNE LL € s3 
Figure 10-1. Two-Stage Adaptive Testing with Three Measure! 
Each examinee takes routing test and one measurement test: dran 
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jeep su end aoe a Xp TB d 10 
Stage 


Test Taker —., x E d A + S + + + 
Responses 


Begin 


Figure 10-2. Pyramidal Testing Model. Heavy line shows route of test taker 
whose item responses are listed across the top. 


Computerized Adaptive Testing (CAT). Several variants of both adaptive 
testing models illustrated in Figures 10-1 and 10-2 have been tried with both 
paper-and-pencil and computer forms. However, more complex models, which do 
not use preestablished, fixed patterns of item sequences, are feasible only with 
computerized adaptive testing (Embretson, 1992; B. E Green, 1983; Wainer et al., 
1990). These CAT procedures utilize the item-response theory (IRT) techniques 
described in chapter 7 for assembling the item pool, for testing individuals, and 
for scoring individual performance. For each item in the pool, there is an estimate 
of the ability required for a 50-50 chance of passing that item. This ability 
estimate is the score the individual receives for passing that item. It reflects the 
difficulty level, discriminative value, and probability of guessing correctly that are 
associated with that item. Also available is the item information function, 
indicating the precision of measurement. The test information function, which is 
the sum of the item information functions, serves the same purpose as the tradi- 
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tional standard error of measurement (SEM). Following each item response, the 
computer selects the next item on the basis of the individual’s total response 
history up to that point. Testing continues with the addition of new items until 
the test information function reaches a preestablished standard. Thus all individ- 
uals are measured to the same level of precision. 

The individual's score is based not on the number of items answered: cor- 
rectly, but on the difficulty level and other psychometric characteristics of those 
items. The total test score is derived from the ability estimates corresponding to 
each item passed. This ability estimate is readjusted and refined as each new 
item is added, until the predetermined measurement precision is reached. Such 
scores will be comparable for all persons examined with the item pool, regard- 
less of the particular set of items given to each individual. The actual proce- 
dures for constructing a CAT instrument can be facilitated through the use of 
several available computer programs, such as MicroCAT, distributed by Assess- 
ment Systems Corporation.! 

In general, research by various methods indicates that individualized adaptive 
testing can achieve the same reliability and validity as conventional tests, with 2 
much smaller number of items and less testing time. It also provides greater preci- 
sion of measurement for individuals at the upper and lower extremes of the ability 
tange covered by the test (Lord, 1970, 1971a, 1971b, 1971c; Weiss, 1982). There 
has also been considerable research showing that the correlations between prop- 
erly constructed CAT forms of tests and their paper-and-pencil forms are about as 
high as the reliability coefficients of most tests. Such findings suggest that essen- 
tially the same constructs can be measured by both modalities (Mead & Drasgow, 
1993). At the same time, there-are testing situations for which CAT is not suit- 
able, such as speed tests and screening tests that sort persons in terms of a cutoff 
score (Wainer, 1993b). Special attention has been given to the development of 
technical guidelines for assessing CAT instruments (Green, Bock, Humphreys, 
Linn, & Reckase, 1984). 

Adaptive testing is especially appropriate for use in the individualized instruc- 
tional programs cited in chapter 3, in which students proceed at their own pace 
and may therefore require test items at widely different difficulty levels. Comput- 
erized testing makes it possible to stop testing as soon as the individual's responses 
provide enough information for.a decision about her or his content mastery. 
Other uses of computerized adaptive testing are being actively investigated and 
developed in several settings. One application is illustrated by a computerized 
adaptive test developed jointly by Educational Testing Service and the College 
Board for assigning entering college freshmen to appropriate English and mathe- 
matics courses (Smittle, 1990; Ward, Kline, & Flaugher, 1986). Because of its in- 
dividualized item selection, this test requires little testing time and provides 
immediate scoring. Hence, it can be administered at registration time and permits 
immediate placement of students in courses or sections best suited to their present 
performance level. 


"Address given in Appendix B. See also Quan, Park, Sandahl, and Wolfe (1984) and Weiss and Vale (1987). 
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Another major application of CAT is in large-scale personnel selection and 
classification programs in industry, government, and the military services. CAT is 
especially appropriate for these purposes, for at least three reasons: (1) the steady 
flow of applicants to bejtested, thereby precluding the testing of very large groups 
at any one time and place; (2) the wide-range of ability levels to be covered; and 
(3) the greater test security, since each applicant receives a different set of items 
from a large pool stored in the computer. Explanatory research toward the devel- 
opment of a.CAT version of the Armed Services Vocational Aptitude Battery 
(ASVAB) has been in progress for some time (McBride & Martin, 1983; Moreno, 
Wetzel, McBride, & Weiss, 1984; Wiskoff & Schratz, 1989). Gradually, CAT edi- 
tions are being developed for major existing group tests, such as the Differential 
Aptitude Tests,? described in the last section of this chapter. For many practical 
applications, as well as for substantive research on the nature and sources of indi- 
vidual differences, CAT offers undoubted advantages. A clear and effective pre- 
sentation of its promise for the future of testing can be found in Embretson 
(1992). 


MULTILEVEL BATTERIES 


Overview. Unlike major individual scales and computerized adaptive tests, 
traditional group tests present the same items to all test takers, regardless of their 
individual responses. For this reason, any given group test must cover a relatively 
restricted range of difficulty, suitable for the particular age, grade, or ability level 
for which it is designed. To provide comparable measures of intellectual 
development over a broader range, series of overlapping multilevel batteries have 
been constructed. Thus, any one individual is examined only with the level 
appropriate to her or him, but other levels can be used for retesting the same 
person in subsequent years, or for comparative evaluations of different age groups. 
The fact that successive batteries overlap provides adequate floor and ceiling for 
individuals at the extremes of their own age or grade distributions. It should be 
recognized, of course, that the match between item difficulty and test taker ability 
provided by multilevel batteries is at best approximate. Unlike the individualized 
procedures followed in CAT, moreover, the match is based on prior knowledge 
about the test takers, such as their age or school grade, rather then their own test 
responses. i 

Multilevel batteries are especially suitable for use in the schools, where compa- 
rability of scores over several years is desirable. For this reason the levels are typi- 
cally described in terms of grades. Most multilevel batteries provide a reasonable 
degree of continuity with regard to content or intellectual functions covered. 
Scores are expressed in terms of the same scale of units throughout. The proce- 
dures of item response theory (IRT, described in chap: 7) are being used increas- 


2DAT-Adaptive (1987), corresponding to Paper-and-pencil DAT-Form V (1981). 
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ingly for obtaining score continuity and comparability throughout the range. In 
the process of test standardization, student groups are given overlapping test lev- 
els to provide the necessary linkage data. The normative samples examined at dif- 
ferent levels are also more nearly equivalent than would be true of independently 
standardized tests. Individual levels usually cover from one to three grades. The 
total range that can be uniformly tested with a given multilevel battery, on the 
other hand, may extend from kindergarten to college entrance. 

Most of the batteries yield a total standard score, corresponding to the tradi- 
tional “IQ” in individual tests. Some batteries furnish several types of norms, in- 
cluding percentiles, stanines, or grade equivalents, as well as standard scores. In 
addition to a total, global score, most batteries also yield separate verbal and 
quantitative, or linguistic and nonlinguistic, scores. This breakdown is in line 
with the finding that an individual’s performance in verbal and in other types of 
subtests may be quite dissimilar, especially at the upper levels. 

The names of the batteries are also of interest. Such terms as “intelligence,” 
“general ability,” “mental ability,” “mental maturity,” “academic potential,” and 
“school ability,” are used to designate essentially the same type of test. In the psy- 
chometrician’s vocabulary, these terms are virtually synonymous and inter- 
changeable. It is noteworthy that, in the most recently developed or revised 
batteries, the term “intelligence” has been replaced by more specific designations. 
This change reflects the growing recognition that the term “intelligence” has ac- 
quired too many excess meanings, which may lead to misinterpretations of test 
scores. The multilevel batteries sample major intellectual skills found to be pre- 
requisite for schoolwork. Their primary function is to assess the individual's readi- 
ness for school learning at each stage in the educational process. 

" oct 

Representative Batteries: The nature and scope of current multilevel ability 
batteries can be illustrated with the three batteries listed in Table. 10-1. These 
batteries were chosen because of the recency of their latest revisions, the high 
quality of their test construction procedures, and the size and representativeness 
of their standardization. samples. Still another noteworthy. feature of these 
batteries is that each was standardized concurrently with one or two multilevel 
batteries of educational achievement tests for the same grades, to be discussed in 
chapter 17. By administering both types of instruments to the same standardiza- 
tion samples, it is possible to establish correspondences between the two sets of 
scores. As a result, the two instruments can be used jointly to permit a fuller 
exploration of the student's educational development and of the conditions that 
influence it. 4 

Both reliability and validity of these batteries were investigated extensively by 
appropriate procedures. Kuder Richardson: reliabilities of both total scores and 
scores on the two or three separate content batteries, all computed within single 
grade levels, are mostly in the -90s. Retest correlations are high, indicating eet 
factory stability. Correlations with school grades and with achievement tests ine si 
cate good predictive validity. Intercorrelations of part scores, as well as factoria 
analyses, revealed a large general factor through each total battery. 
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Representative Multilevel Batteries 


$ i N : E 
Battery 3 i 15 ec? Ae Jointly Normed with 


“There is also a separate Primary Test of Cognitive Skills (PTCS), with a different set of tests appropriate to the 
kindergarten and first- levels ^ f Ini 


1f 1 Trier 
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ypical Test Content at Different Levels. The youngest age at which it 
has proved feasible to employ group tests is the kindergarten and first-grade level. 
At the preschool ages, individual testing is required in order to establish and 
maintain rapport, as well as to administer the oral and performance type of items 
suitable for such children. By the age “of 5 or 6, however, it is possible to 
administer printed tests to small groups of no more than 10 or 15 children. In 
such testing, the examiner must still give considerable individual attention to the 
children to make sure that directions are followed, see that pages are turned 
properly in the test booklets, and supervise other procedural details. With one or 
two assistant examiners, somewhat larger groups may be tested if necessary. 

Group tests for the primary level generally cover kindergarten and the first two 

or three grades of elementary school. In such tests, each child is provided with a 
booklet on which are printed the pictures and diagrams constituting the test 
items. All instructions are given orally and are usually accompanied by demon- 
strations. Practice exercises are frequently included in which children try one or 
two sample items and the examiner or proctor checks the responses to make cer- 
tain that the instructions were properly understood. The children mark their re- 
sponses on the test booklet with a crayon or soft pencil. Most of the tests require 
only marking the correct picture out of a set. A few call for simple motor coordi- 
nation, as in drawing lines that join two dots. Obviously, tests for the primary 
level can require no reading or writing on the part of the test taker. 
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Most multilevel ability batteries include tests suitable for the primary level. The 
types of test items used at this level are illustrated in Figure 10-3, which shows 
some sample items from: Level A of the OLSAT, suitable for kindergarten. In 
recognition of the rapid intellectual growth occurring at the early ages, the latest 
edition of OLSAT provides four separate levels (A, B, C, D) for kindergarten and 
Grades 1, 2, and 3. This represents more differentiation than is available in earlier 
editions of this battery, as well as in other multilevel batteries. At Level A, all in- 
structions are given orally by the examiner. The child responds by filling in the 
circle under the chosen answer picture, as shown in Figure 10-3. 

Four of the ten types of items included in Level A are illustrated. The total test 
requires about 75 minutes. It is administered in two sessions, each of which includes 
a 5-minute break after the first 15 or 20 minutes. There is also a Practice Test, with 
similar item types and instructions, which may be given on an earlier day. The sam- 
ple items shown in Figure 10-3 are relatively simple items employed to familiarize 
the children with the items they will encounter in the test itself. The explanations 
given in Figure 10-3 area highly condensed version. of the detailed oral instructions 
and clear description of item content that accompany each item. There are also 
some minor format differences to facilitate understanding and hold the attention of 
young children. For instance, pages and rows are identified not only by numbers but 
also by tiny pictures of familiar objects, such as cup, shoe, or scissors; and each child 
is given a marker to keep track of the proper row to work on. TOTUM. | 

Tests for the elementary school level, extending from the third or fourth grade 
up, have much in common in both content and general design. Since functional 
literacy is presupposed at these levels, the tests are predominantly verbal in con- 
tent; most also include arithmetic problems or other numerical tests. In addition, 
a few batteries provide nonreading tests designed to assess the same abstract rea- 
soning abilities in children with foreign-language background, reading disabili- 


Item types appropriate for the elementary school level are illustrated in Figure 
10-4. These items are typical of those employed at intermediate levels of the Cogni- 


levels for Grades K through 3 and eight levels covering Gra les 3 through 12. Each 
level is printed in a separate booklet. Examinees taking different levels start and stop 


most effective. e j ` 
Levels A through H contain the same nine subtests, grouped into three batter- 


ies as follows: 


Verbal Battery —Vetbal Classification, Sentence Completion, Verbal Analo- 
gies. Bi 
Quantitative Battery—Quantitative Relations, Number Series, Equation Building. 
Nonverbal Battery—Figure Classification, Figure Analogies, Figure Analysis. 


These subtests use no words or numbers, but only geometric or figural ele- 


ments; the items bear relatively little relation to formal school instruction. 
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1 
Picture Classification: Mark the picture that does not belong with the others. 


P 


[7] 


[9) O © 


Figural Analogies: Mark under the figure that belongs in the empty box. 


A Sec 


Following Directions: Mark the number that is right below a heart. 


Picture Series: Mark under the picture that shows what belongs inthe empty box. 
a |e let log] | oy 
we Ne ee ex ss ` 
O Oo © (©) 


Figure 10-3. Some Sample Items Used in the Otis-Lennon School Ability Test 
(OLSAT, 7th ed., Level A). 


LE 


(Copyright © 1996 by The Psychological Corporation. All tights reserved. Reproduced by permission:) 
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Each subtest is preceded by practice exercises with detailed explanations. Practice 
Tests that can be given in advance of the actual administration are also available. 
Figure 10-4 shows a typical item from six of the nine subtests, with condensed 
and slightly reworded instructions. In difficulty level, these items correspond 
roughly to items given in Grades 4 to 6. The manual recommends that all three 
batteries be given to each child, in three testing sessions. For most children, the 
Nonverbal Battery does not predict school achievement as well as do the Verbal 
and Quantitative batteries. However, a comparison of performance on the three 
batteries may provide useful information regarding special abilities or disabilities. 

The upper levels of multilevel batteries, appropriate for high school students, do 
not differ basically from those designed for the elementary school grades except in 
degree of difficulty. These levels are also suitable for testing general, unselected 
adult groups for various purposes- The content of tests at this level can be illus- 
trated by items fromthe highest level of the Test of Cognitive Skills (TCS/2). 


Each level of this battery contains four tests: 


Sequences—understanding and applying a tule or principle in a pattern or se- 
quence of figures, letters, or numbers. 

Analogies—identifying a relationship within a pair of pictures and completing a 
second pair exhibiting the same relationship; includes scenes, people, ani- 
mals, objects, or graphic symbols. 

Verbal Reasoning—tested by a diversity of item types, including identification of 
essential elements of objects or concepts, classifying objects according to 
common attributes, inferring relationships between sets of words, or drawing 
logical conclusions from short passages- ` 

Memory—definitions of a set of artificial words (nonsense syllables) are pre- 
sented for study and their recall is tested after about 25 minutes, during 
which other tests are administered. 


There is also a Practice Test to be administered one or two days before the actual 
testing session. Illustrations of three out of the four item types are given in Figure 
10-5. In this battery the same item types from the sequences, analogies, and verbal 
reasoning tests are used from the 4th grade up through the 12th grade; and the 
same sample items are included at all these levels. The upper two levels, corre- 
sponding to the high school grades, are differentiated on the basis of the empiri- 
cally established higher diffculty of their items. ter 

The inclusion ©: a memory test is a distinctive feature of the TCS/2 battery. 
Separate scores are provided for verbal, nonverbal, and memory tests. These apti- 
tude areas were identified through preliminary factor analyses, which guided the 


development and selection of test items. Scaling across levels was achieved during 


the standardization process, through the techniques uice cin theory Ge 
see chap: 7). For this purpose, student samples too! inkage tests containing ` ems 
from two adjacent levels (TCS/2 Technical Report, 1993, pp. 113-114). The TCS/2 


battery made unusually good use of IRT procedures in both test development and 


Verbal Classification: Think how the words in dark type are alike. 
Then find the word in the line below that goes with them. 


kind ` friendly ` helpful 


A capable B active C generous D pretty E strong 


Verbal Analogies: Figure out how the first two words go together. 
Then find the word that completes the second pair in the same way. 


ship —* harbor : truck —» 


A drive B highway . . C garage D gasoline E freight 


Number Series: Figure out the rule used to arrange the number 
series below. Then choose the number that should come next. 


OS TES, Ua ak 2 d =J 


AO Bi c2 D3 EA 


Equation Building: The numbers and signs at the top can be put 
together so as to give one of the answers at the bottom. Mark that 
answer. 


40 K2 L4 M6 N 10 


Figure Classification: The first three figures are alike in some way. 
Find the figure at the right that is like the first three. 


" L M N 
CH ES cH IE 
4 S AN 


Figure Analogies: Decide how the first two figures are related to each 
other. Then find the one figure at the right that completes the second 
pair in the same way. ` 


J K L M N 
O-DD- DKS XS 
Figure 10-4. Some Item Types from Cognitive Abilities Test. Responses 
marked on separate answer sheet: Correct answers: C, C; D, K, J, K. 


(From CogAT, Form 5, Practice Test for Levels A-H. Copyright © 1993 by The Riverside Publishing 
Company. Reproduced by permission.) 


K 
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Sequences. 
Figure out the pattern in each series and then choose the item in the row below that 
belongs in the blank space 


Analogies. 
Decide how the two pictures in the top boxes are related to each other. Then locate the 
picture at the right that completes the second pair in the same way. 


Verbal Reasoning. 

1. Look at the underlined word, 
alphabet. Which of the words 
below names something that 


2. Given the information in the two 
top sentences, which of the 
sentences below must. be true? 


must always be a part of it? 

Big Ben is a clock in England. 
alphabet Judy has visited Big Ben. 
A words A People often visit Big Ben. 
B letters B Many clocks in England are big. 
C numbers C BigBenis named after a person. 
D sentences D Judy has been to England. 


Figure 10-5. Some of the Item Types Used in the Test of Cognitive Skills. 
Responses are marked on a separate answer sheet. 

(From TCS/2 Practice Test, Levels 2-6. Copyright ©1992 by CTB/McGraw-Hill School Publishing Company. 
Reproduced by permission.) 1 
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scoring. Thus it is possible to obtain scores that reflect not simply the number of 
items passed but also the difficulty level of the individual items passed. 


Recognition of Multiple Aptitudes. As suggested in the opening para- 
graphs of this chapter, there is a clearly discernible trend toward bridging the ini- 
tial gap between the testing of overall, general ability and the measurement of 
separate, relatively independent aptitudes. The-rapprochement has come from 
both sides of what began as controversial and temporarily irreconcilable ap- 
proaches to ability. testing. This trend closely parallels what was happening with 
individual tests, as discussed in chapter 8. In the present case, the first multilevel 
batteries were designed as a group version of the individual intelligence tests, al- 
though they usually had a more narrowly defined goal, namely to assess scholastic 
aptitude, or readiness for successive educational levels. Gradually, it became ap- 
parent that a single overall score could profitably be supplemented with some- 
what narrower scores. This development occurred in all three of the batteries 
considered in this section. 

In the OLSAT (1996 ed.); it was pointed out that the total score is itself lim- 
ited to a group of "verbal-educational" aptitudes; the battery makes no attempt to 
assess the "practical-mechanical" segment of general intelligence. Furthermore, 
there is provision for identifying narrower, supplementary scores in verbal and 
nonverbal part scores. This distinction, however, focuses primarily on those test 
items that do and those that do not require the use of language in the test re- 
sponses. The differentiation is thus directed primarily to the use of tests with stu- 
dents having limited knowledge of English. It is also possible, however, to obtain 
stanine scores (within age or grade) for five test clusters, within the broad verbal 
and nonverbal categories. These clusters include verbal comprehension, verbal 
teasoning, pictorial reasoning, figural reasoning, and quantitative reasoning. The 
manual observes that a comparison of an individual’s relative performance in 
these clusters may help to identify students’ strengths and. weaknesses (OLSAT, 
7th ed., Technical Manual, 1997). 

The CogAT (Form 5, 1993) provides norms not only for its total composite 
score but also for verbal, quantitative, and “nonverbal” (i.e., spatial) reasoning 
scores. Moreover, individual score reports include bar graphs for these three area 
scores, as well as for total score. The manual specifically calls attention 
to the value of the bar graph profile for predicting academic achievement. The 
instructions for score interpretation make repeated reference to the practical 
importance of considering the individual’s score profile (see e.g., Riverside 2000, 
Ga P 44). Thus the recognition of multiple aptitudes moved. another step 

forward. 

. The Test of Cognitive Skills (TCS/2, 1992) is even more explicit in recogniz- 
ing the value of multiple aptitude data in the assessment of academic perfor- 
mance. Norms are available not only for total scores on the battery but also for 
each of the four subtests and for a “nonverbal” total (Sequences and Analogies 
subtests). Moreover, the battery itself was developed to assess three broad cogni- 
tive traits that had been identified through factor-analytic research, namely, Ver- 
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bal Reasoning, Nonverbal Reasoning, and Memory. There is thus specific recog- 
nition of the need for multiple aptitude testing, to be discussed in the following 
section. 


MEASURING MULTIPLE APTITUDES 


Traditional intelligence tests, whether administered individually or ‘to groups, 
were designed primarily to yield a single, global measure of the individual's gen- 
eral level of cognitive development, such as an IQ. Both practical and theoretical 
developments, however, soon drew attention to certain differentiable aptitudes 
within the loose conglomerate represented by early intelligence tests. These de- 
velopments led to the construction of separate tests for the measurement of a few 
widely applicable aptitudes. At the same time, they led to a refined definition and 
a fuller understanding of what the intelligence tests themselves measured. 

A number of events contributed to the growing interest in the measurement of 
different aptitudes: First, there was an increasing recognition of intraindividual 
variation in performance on intelligence tests. Crude attempts to compare the in- 
dividual's relative standing on different subtests or item groups antedated the de- 
velopment of multiple aptitude batteries by many years. Intelligence tests, 
however, were not designed for this purpose. The subtests or item groups Were 
often too unreliable to justify intraindividual comparisons. In the construction of 
intelligence tests, moreover, items or subtests were generally chosen to provide a 
unitary and internally consistent measure. In such a selection, an effort was there- 
fore made to minimize, rather than maximize, intraindividual variation. Subtests 
or items that correlated very low with the rest of the scale would, in general, be 
excluded. Yet these are the very parts that would probably have been retained if 
the emphasis had been on the differentiation ‘of abilities. Because of the way in 
which most intelligence tests were constructed, it is unlikely that performance on 
these tests can be significantly differentiated into more than two categories, such 
as verbal and nonverbal, or linguistic and quantitative. 

The development of multiple aptitude batteries was further stimulated by the 
gradual realization that so-called general intelligence tests are in fact less general 
than was originally supposed. It soon became apparent that many such tests were 
primarily measures of verbal comprehension. Certain areas, such as that of me- 
chanical abilities, were usually untouched, except in some of the performance and 


nonlanguage scales. As these limitations of intelligence tests became evident, 


: : intelli » Distinctions between 
psychologists began to qualify the term “intelligence.” Distinctions 

“academic” and “practical” intelligence were suggested by some. Others va of 
“abstract,” “mechanical,” and “social” intelligence. Tests of “special aptitudes 


were likewise constructed, to supplement the intelligence tests. But closer analy- 


sis showed that the intelligence tests themselves could be said to measure a cer- 
tain combination of special aptitudes, such as verbal and numerical abilities. 

A strong impetus to differential aptitude testing was also provided by the grow- 
ing activities of psychologists in career counseling, as well as in the selection an 
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classification of industrial and military personnel. The early development of spe- 
cialized tests in clerical, mechanical, and other vocational areas is a reflection of 
such interests. The assembling of test batteries for the selection of applicants for 
admission to schools of medicine, law, engineering, dentistry, and other profes- 
sional fields represents a similar development which had been in progress for 
many years. Moreover, a number of differential aptitude batteries, such as those 
prepared by the armed services and by the U.S. Employment Service, were the di- 
rect result of ional selection or classification work. 

Finally, dek ag trait organization through the techniques of factor analysis? 


Differential Aptitude Tests. One of the most widely used multiple aptitude 
batteries is the Differential Aptitude Tests (DAT). First published in 1947, the 
DAT has been periodically revised (5th ed., Form C, 1992). The battery was 
Lë prio for use in educational and career counseling of students in 
grades 8 to 12. The 5th-edition DAT is available in two levels: Level 1, designed 
primarily for students in Grades 7 to 9 and adults who have completed these years 
of schooling; and Level 2, for students in Grades 10 to 12 and adults with more 
than 9 years of schooling but who may not have graduated from high school. 
The DAT comprises the eight tests: Verbal Reasoning, Numerical 
Reasoning, Abstract Reasoning, Perceptual Speed and Accuracy, Mechanical 
Reasoning, Space Relations, Spelling, and Language Usage. Sample items from 
four of these tests are shown in Figure 10-6. A Practice Test covering all eight 
areas is available for advance use with examinees. There is also a special form, the 
Aptitude Tests for Personnel and Career Assessment, in which each 
of the eight tests has been shortened and printed in a separate booklet. With this 
on particular tests may be chosen for relevance to specific occupations and 
Like most major current tests, the DAT can be completely administered in a 
computerized version. A more recent development that is now being explored is a 
computerized adaptive testing (CAT) form (DAT Adaptive), available since 1987. 
As in all adaptive testing, the examinee takes only those items that are appropriate 
for her or his performance level. For this CAT version, the items used were from an 
earlier DAT (Form V) and were analyzed by the Rasch model, a simplified, one- 
Parameter model of item response theory (see chap. 7). 
An extensive collection of DAT m ty data has been assembled over the 
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the battery in different occupational settings or included it in research projects. 
Most of these data are concerned with predictive validity in terms of high school 
achievement in both academic and vocational programs. Many of the coefficients 
are high, even with intervals as long as three years between test and criterion 
data. The results are somewhat less encouraging with regard to differential predic- 
tion. Although, in general, verbal tests correlate more highly with English courses 
and numerical tests with mathematics courses, there is evidence of a large general 
factor underlying performance in all academic work. Verbal Reasoning, for exam- 
ple, gives high correlations with most courses. It is chiefly for this reason that the 
VR + NR score was introduced as an index of scholastic aptitude. Being the sum 
of the scores on the Verbal Reasoning and Numerical Reasoning subtests, this 
index correlates in the .70s and .80s with composite criteria of academic achieve- 
ment. Norms are provided for this index, which is one of the scores regularly in- 
cluded in the DAT profile (see Fig. 4-6). There is also a Partial Battery, containing 
only the VR and NR subtests, which can be used only if a general index of 
scholastic aptitude is wanted. 

From another angle, there is growing evidence that traditional tests of “general 
intelligence” or “scholastic aptitude"— whether designed for individual or group 
administration—yield substantial validity coefficients against a wide variety of 
educational and occupational criteria (L. S. Gottfredson, 1986a; Guion & Gib- 
son, 1988; Pearlman, et al., 1980; Schmidt, Hunter, Pearlman, & Shane, 1979). 
Such tests include essentially the same cluster of cognitive skills and knowledge 
assessed by the VR + NR score from the DAT. Thus it can be seen that the early 
gap between intelligence and multiple aptitude tests is being bridged from both 
ends. Tests like the DAT have been placing more emphasis on the use and inter- 
pretation of broad scores, such as VR + NR. At the same time, the initially gen- 
eral tests were putting more emphasis on the use and interpretation of separate 
subtest scores and profile analysis. For reviews of the DAT, see Hattrup (1995) 
and N. Schmitt (1995). 


Multidimensional Aptitude Battery. An even closer approximation to the 
emerging pattern for ability testing is exemplified in the more recently developed 
Multidimensional Aptitude Battery (MAB). First published in 1984, its adminis- 
trative procedures, norms, and manual have subsequently been updated (Jackson, 
1994b). The MAB is a group test designed to assess the same aptitudes as the 
Wechsler Adult Intelligence Scale—Revised (WAIS-R; see chap. 8). It includes 
five subtests in the Verbal Scale? and five in the Performance Scale, and it yields 
V, P, and Full Scale deviation IQs. The MAB is suitable for adolescents and adults 
but is not recommended for use with mentally retarded or mentally disturbed 


persons, whose condition could interfere with their understanding or following 
test instructions. 


— MM 
"The WAIS-R Digit Span subtest has no parallel in the MAB. This subi i i E 
and-pencil format; it also tended to correlate lowest with Full Scale Wechierscores: pm 
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The ten subtests, which have the same names as the corresponding WAIS-R 
subtests (with one exception), ate listed below: 1 


VERBAL PERFORMANCE 
Information Digit Symbol 
Comprehension Picture Completion 
Arithmetic Spatial 

Similarities Picture Arrangement 
Vocabulary Object Assembly 


The Block Design subtest of the WAIS-R has been replaced by the Spatial subtest 
in the MAB. Considerable ingenuity was displayed in designing paper-and-pencil 
items to measure. the same functions covered by the individually administered 
WAIS-R; this was especially true of the Performance subtests. Figure 10-7 shows 
a simple, demonstration item for the Picture Completion subtest and one for the 
Spatial subtest. The respondents task in all items within each of these subtests is 
the same as that illustrated in these items. In the Picture Completion subtest, the 


the five figures at the right could result from simply turning the figure at the left 
into a different position on the page; the other given options require the figure to 
be flipped over. 

The five Verbal subtests are presented in one booklet and the five nonverbal 
“Performance” subtests inca second booklet. Each booklet begins with practice 
problems illustrating the types of items used in three of the five subtests} each 
subtest begins with one, two, Or three further demonstration items. General 
instructions. and specific. instructions for each subtest are given in the manual; 
they can be presented orally, by an audio cassette, or by computer. Responses are 
recorded on separate answer sheets or on a computer. Scoring can be done manu- 
ally from templates or by computer. ‘As one of the available options, one of the 
current versions of MAB permits full computerized administration and scoring by 
the local test user. 

By reference to a normative table, raw scores on each of the 10 subtests can be 
transformed to uniform scaled score equivalents (standard scores wi M=50 
and SD = 10). The sums of these scaled scores on V, P, and Full Scale are used to 
look up deviation IQs (M = 100, SD = 15) within each of nine age groups, ex 
tending from 16 to 74 years. There are also separate tables for finding scaled 
score equivalents within each of the same nine age groups; these scaled scores 
can be used in plotting age-appropriate profiles. An advantage of the use of 
scores from a single battery rather than from a set of different tests Is that the 
battery tests were all normed on the same standardization sample and are hence 
directly comparable. 

On Te whale, the psychometric procedures followed in developing the MAB 
were of high technical quality; and every step was supported by intensive research 
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Picture Completion - Choose the letter that begins the word describing the missing part of 
the picture. 


The answer is Light, so A should be marked. 


Spatial - Choose one figure to the right of the vertical line which is the same as the figure on 
the left. One figure can be turned to look like the figure on the left; the others would have to 
be flipped over. 


K| Est 


The correct answer is A, so A should be marked. The others BCDE would have to be flipped 
over. 


Figure 10-7. Demonstration Items from Two Performance Tests of the 
Multidimensional Aptitude Battery (MAB). 


(Copyright © 1983 by Douglas N. Jackson. Reproduced by permission.) 


extending over more than a decade. A noteworthy feature of the MAB is its em- 
pirical linkage to the WAIS-R. It was first shown that each MAB subtest, as well 
as total V, P, and Full Scale scores, correlated very highly with WAIS-R corre- 
sponding scores in a heterogeneous sample of 145 adolescents and adults. In fact, 
these correlations were as high as the correlations between WAIS and WAIS-R 
scores, or higher, with two exceptions (Digit Symbol and Spatial/Block Design). 
It should be noted that these correlations were obtained despite the fact that all 
MAB items are new; there are no common items between the two scales. Linear 
equating of MAB scores to WAIS-R scores was then carried out on a heteroge- 
neous equating sample of 160 respondents, aged 16 to 35 years, who had taken 


See reviews by S. B. Reynolds (1989) and Silverstein (1989). 
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both tests. Preliminary equating on student samples and psychiatric patient sam- 
ples suggested that such calibration was generalizable over different samples of 
test takers. These calibrating procedures offer a promising way of expressing the 
scores on newly developed tests in terms of a uniform scale based on available 
data from a large, representative standardization sample. They represent a step to- 
ward the goal of national anchor norms discussed in chapter 3. 

From another angle, this battery exemplifies clearly the current trend toward 
hierarchical scores. It provides fully interpretable scores at the 10-subtest level, at 
the broader Verbal and Performance level, and with the overall total score for the 
complete battery. The test user thus has enhanced flexibility in the choice of ap- 
propriate score level for her or his particular testing purpose—a condition that 
seems likely to characterize ability testing in the twenty-first century. 


Nature of Intelligence 


ll psychological tests are designed to measure behavior. Hence the selection 
of proper tests and the interpretation of test results require knowledge 
about human behavior. Familiarity with relevant behavioral research is 


cal research contribute to the understanding of: (1) the behavior measured by 
tests of cognitive abilities or "intelligence," (2) the sources of individual differ- 
ences in such behavior and (3) the predictability of such behavior at a subsequent 


Sternberg, 1990), and by psychologists who specialize in different areas or identify 
with different theoretical orientations (e.g., H. Gardner, 1983, 1993; Sternberg, 
1985a, 1989 —see also Brody, 1992; Lubinski & Benbow, 1995; Messick, 1992; 
H. Rowe, 1991). An early display of this diversity of meanings was provided in 


gence. .. ," 1921). A similar Survey was undertaken 65 years later (Sternberg & 
Detterman, 1986). An examination of these publications should be of consider- 
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our goal is more narrowly limited: What should we know about that particular 
segment of human intelligence assessed by traditional intelligence tests and sym- 
bolized by the IQ? It is in this more circumscribed sense that the term "intelli- 
gence” is used in the present discussion (see Anastasi, 1983c). 


MEANING OF AN IQ 


For the general public; IQ is not identified with a particular type of score on a par- 
ticular test but is often a shorthand designation for intelligence.! So prevalent 
has this usage become that it cannot be merely ignored or deplored as a popular 
misconception. To be sure, when considering the numerical value of a given IQ, 
we should always specify the test from which it was derived. Different intelligence 
tests that yield an IQ do in fact differ in content and in other ways that affect the 
interpretation of their scores. Some of these differences among tests sharing the 
common label of "intelligence-test" were apparent in the examples considered in 
the preceding chapters. Nonetheless, there sa need to reexamine the general 
connotations of the construct “intelligence,” as symbolized by the JC, 

First, tested intelligence should be regarded as a descriptive rather than an ex- 
planatory concept. An IQ is an expression of an individual's ability level ar a 
given point in time, in relation to the available age norms. No intelligence test 
can indicate the reasons for one’s performance. To attribute inadequate. perfor- 
mance on a test or in everyday life activities to “inadequate intelligence" is a tau- 
tology and in no way advances our understanding of the individual’s handicap. In 
fact, it may serve to delay efforts to explore the actual causes of the handicap in 
the individual’s personal history. j i 

Intelligence tests, as well as any other kind of tests, should be used not to label 
individuals but to help in understanding them. This point has been emphasized 
for many years in many sources, ranging from the writings of individual psycholo- 
gists to the formal reports of national committees (Hobbs, 1975a, 1975b; Na- 
tional Commission. . ., 1990). A widely publicized book (Herrnstein & Murray, 
1994), entitled The Bell Curve, has served to perpetuate various stereotypes and 
misconceptions about ethnic and gender differences in intelligence test perfor- 
mance and has added confusion and controversy to an already complex problem. 
An objective and factual treatment of the relevant points is provided in a report 
by the American Psychological Association Task Force on Intelligence (Neisser 
et al., 1996). A symposium at the 1995 convention of the American Psychologi- 
cal Association was also devoted to clarifying relevant issues (Steele, Chair, Au- 
gust 1995). To bring persons to their maximum functioning level we need to start 
where they are at the time; we need to assess their strengths and weaknesses and 
plan accordingly. If a reading test indicates that a child is retarded in reading, we 
do not label the child as a nonreader and stop; nor do we administer a nonverbal 


‘When first introduced, the IQ (Intelligence Quotient) did refer to a type of score—i.e., the ratio of mental age 
to chronological age (see chap. 3). = ! : 
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test to conceal the handicap. Instead, we concentrate on teaching the child to 
read. An important goal of contemporary testing, moreover, is to contribute to 
self-understanding and personal development. The information provided by tests 
is being used increasingly to assist individuals in educational and career planning 
and in making decisions about their own lives. The attention being given to ef- 
fective ways of. communicating test results to the individual attests to the growing 
recognition of this application of testing. 

A second major point to bear in mind is that intelligence is not a single, uni- 
tary ability, but a composite of several functions. The term is commonly used to 
cover that combination of abilities required for survival and advancement within 
a particular culture (Anastasi, 1986c). It follows that the specific abilities in- 
cluded in this composite, as well as their relative weights, vary with time and 
place. In different cultures and at different historical periods within the same cul- 
ture, the qualifications for successful achievement differ. The changing composi- 
tion of intelligence can also be recognized within the life of the individual, from 
infancy to adulthood. One's relative ability tends to increase with age in those 


with numerical and other abstract symbols. These are the abilities that predomi- 
nate in school learning. Most intelligence tests can therefore be regarded as mea- 
sures of scholastic aptitude or academic intelligence. The IQ is both a reflection 


academic intelligence is also an effective predictor of performance in many occu- 
pations and other activities of daily life in such cultures. 


Clearly a misuse of tests. Decisions must be made by persons. Tests represent one 
Source of data utilized in making decisions; they are not themselves decision- 
making instruments. 
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tests (Humphreys, 1979; Resnick, 1976; Sternberg & Detterman, 1979). This in- 
terest has continued with undiminished vigor, spanning. different fields of psy- 
chology and pursued through diverse methodological approaches and from varied 
theoretical orientations. It is reflected in continuing series of volumes (Detter- 
man, 1985-1993; Sternberg, 1982-1989), as well as in a complete encyclopedia 
devoted to the topic (Encyclopedia of Human Intelligence, 1994). 

Efforts to understand what intelligence tests measure have included not only 
the standard statistical procedures of construct validation, such as factor analysis, 
but also the application of information-processing techniques to the tasks pre- 
sented in intelligence tests (see chap. 5). The latter approach focuses on the ele- 
mentary processes whereby a test taker arrives at the answer to a test question, 
rather than considering only the correctness of the answer. This type of analysis 
should contribute substantially to the diagnostic use of tests and to the develop- 
ment of training programs to meet specific individual needs. 


HERITABILITY AND MODIFIABILITY? 


Much confusion and controversy have resulted from the application of heritabil- 
ity estimates to intelligence test scores. A well-known example is an article by 
Jensen (1969) that engendered great furor and led to many heated arguments, 
some of which are continuing in the 1990s. Although there are several aspects to 
this controversy and the issues are complicated, a major substantive source of 
controversy pertains to the interpretation of heritability estimates. Specifically, a 
heritability index shows the proportional contribution of genetic or hereditary 
factors to the total variance of a particular trait in a given population under exist- 
ing conditions. For example, the statement that the heritability of Stanford-Binet 
IQ among urban American high school students is .70 would mean that 70% of 
the variance found in these scores is attributable to hereditary differences and 
30% is attributable to environment. 

Heritability indexes have been computed by various formulas (see, e.g., Jensen, 
1969; Loehlin, Lindzey, & Spuhler, 1975), but their basic data are measures of fa- 
milial resemblance in the trait under consideration. A frequent procedure is to 
utilize intelligence test correlations of monozygotic (identical) and dizygotic (fra- 
ternal) twins. Correlations between, monozygotic twins reared together and be- 
tween monozygotic twins reared apart in foster homes have also been used. 

Several points should be noted in interpreting heritability estimates. First, the 
empirical data in familial resemblances ate subject to some distortion because of 
the unassessed contributions of environmental factors. For instance; there is evi- 


dence that monozygotic twins share a more closely similar environment than do 


4 ion di in thi i ofan extensive area of research on the operation of 
‘The question discussed in this section relates to a small part OUR UE dis federal toate which in- 
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dizygotic twins (Anastasi, 1958, pp. 287-288; Koch, 1966). And the environ- 
ments of siblings reared together may be psychologically quite dissimilar (Daniels 
& Plomin, 1985). Another difficulty is that twin pairs reared apart are not as- 
signed at random to different foster homes, as they would be in an ideal experi- 
ment; it is well known that foster home placements are selective with regard to 
characteristics of the child and the foster family. Hence, the foster home environ- 
ments of the twins within each pair are likely to show sufficient resemblance to 
account for some of the correlation between their test scores. There is also evi- 
dence that twin data regarding heritability may not be generalizable to the popu- 
lation at large because of the greater susceptibility of twins to prenatal trauma 
leading to severe mental retardation. The inclusion of such severely retarded 
Cases in a sample may greatly increase the twin correlation in intelligence test 
scores (Nichols & Broman, 1974). 

Apart from questionable data, heritability indexes have other intrinsic limita- 
tions (see Anastasi, 1971; Hebb, 1970). It is noteworthy that in the early part of 
the previously cited article, Jensen (1969, pp. 33-46) clearly listed these limita- 
tions among others. First, the concept of heritability is applicable to populations, 
not individuals. For example, in trying to establish the etiology of a particular 
child’s mental retardation, the heritability index would be of no help. Regardless 
of the size of the heritability index in the population, the child’s mental retarda- 
tion could have resulted from a defective gene, asin phenylketonuria (PKU), 
from prenatal brain damage, or from extreme experiential deprivation. 

Second, heritability indexes refer to the population on which they were found 
at the time: Any change in either hereditary or environmental conditions would 
alter the heritability index. For instance, an increase in inbreeding; as on an iso- 
lated island, would reduce the variance attributable to heredity and hence lower 
the heritability index; increasing environmental homogeneity, on the other 
hand, would reduce the variance attributable to environment and hence raise the 
heritability index. Furthermore, a heritability index computed within one popula- 
tion is not applicable to an analysis of the differences in test performance between 
two populations, such as different ethnic groups. 

Third, heritability does not indicate the degree of modifiability of a trait. Even 
if the heritability index of a trait in a given population is 100%, it does not follow 


Regardless of the magnitude of heritability indexes found for IQs in various 
populations, one empirical fact is well established: The JQ is not fixed and un- 
changing; and it is amenable to modification by environmental interventions. 
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Some evidence for this conclusion is examined in the next chapter, in connec- 
tion with longitudinal studies. There has been some progress in identifying char- 
acteristics of accelerating and decelerating environments. Rises and drops in IQ 
may also result from both fortuitous environmental changes occurring in a child's 
life and planned environmental interventions. Major changes in family structure, 
sharp rises or drops in family income level, or adoption into a foster home may 
produce conspicuous increases or decreases in IQ. j 

Worldwide interest in systematic programs for the development of intelligence 
spread rapidly in the late 1970s and the 1980s, and continues to the present. This 
is illustrated by the publication the Human Intelligence International Newsletter 
from 1980 to 1987. With an international editorial board, this newsletter pro- 
vided a quarterly report on cognitive research and its educational applications. 
Another major example is the-10-year program established and systematically im- 
plemented in Venezuela under government sponsorship. Covering many specific 
projects for the development of “thinking skills" from infancy to old age, this pro- 
gram has stimulated the initiation of similar projects in several other countries 
(Collins and Mangieri, 1992; Greenwald, 1982, 1984; Herrnstein, Nickerson; 
Sanchez, & Swets, 1986; Nickerson, 1988; Spitz, 1986; Sternberg, 1986). 

Research on the effects of planned interventions at the infant and preschool 
levels will be examined in chapter 12. Also noteworthy, however, is the growing 
body of evidence demonstrating the effectiveness of such. interventions at later 
life stages. Although on a smaller scale than those directed at the preschool level, 
programs for school-age children have yielded encouraging results (Bloom, 1976; 
Brown & Campione, 1986; Gampione & Brown, 1987; Jacobs & Vandeventer, 
1971; Olton & Crutchfield, 1969; Resnick & Glaser, 1976). Some investigators 
have focused on still older age levels, working with college and professional- 
school students; and. they, too, report significant improvement in both academic 
achievement and scholastic aptitude test performance (Bloom & Broder, 1950; 
Whimbey, 1975, 1977, 1980). Research on elderly persons has also yielded evi- 
dence of learning and transfer effects following training interventions (Willis, 
Blieszner, & Baltes, 1981). Still other investigators have concentrated on educa- 
ble mentally retarded children and adolescents; again with significant improve- 
ments (Babad. & Budoff, 1974; Budoff & Corman, 1974; Feuerstein, 1980; 
Feuerstein et al., 1987; Hamilton & Budoff, 1974; Rand, Tannenbaum, & Feuer- 
stein, 1979).3 1 ji Dir , 

These programs provide training in. widely applicable cognitive skills, problem- 
solving strategies, and efficient learning habits. Of special interest is the empha- 
sis on self-monitoring or autocriticism as a condition for effective performance 
(Flavell,.1979; Owings, Petersen; Bransford, Morris; & Stein, 1980; Whimbey, 
1975). The evaluation of one’s own performance and the recognition of what 
one understands and what one does not understand represent an important first 
step toward improving performance. All too often, the unsuccessful. learner is 


>For critical evaluations of Feuerstein's approach, see Anastasi (1980) and Blagg (1991). 
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unable to differentiate between true understanding and inaccurate or superficial 
understanding. There is evidence that children with learning disabilities are es- 
pecially deficient in autocriticism and in monitoring their own cognition (Kot- 
sonis & Patterson, 1980). 

Other examples of the type of cognitive skills taught in these intellectual de- 
velopment programs were cited in chapter 1. In that chapter, such training in 


1984; Larkin, McDermott, Simon, & Simon, 1980a; Neimark, 1987; Resnick & 
Neches, 1984; Richardson, Angle, Hasher, Logie, & Stoltus, 1996). Although the 
older person, armed with efficient learning techniques, can build up this content 
store more quickly than he or she would have as a child, it is unrealistic to expect 


MOTIVATION AND INTELLIGENCE 


Although it is customary and convenient to classify tests into Separate categories, 
it should be recognized that all such distinctions are superficial. In interpreting 


mance on an aptitude test, as well as her or his performance in school, on the job, 
or in any other context, is influenced by her or his achievement drive, persis- 
tence, value system, freedom from handicapping emotional problems, and other 
characteristics traditionally classified under the heading of "personality. " 


ing (Bloom, 1976, chap. 4; Budoff, 1987; Feuerstein et al., 1987; J- G. Nichols, 1979; 
Renninger, Hidi, & Krapp, 1992; R. E. Snow, 1989). The individual's interests, atti- 
tudes, and self-concept as a learner influence her or his openness to a learning task, 
the desire to learn it well, the attention given to the teacher, and the time devoted to 
the task. And there is evidence that these individual reactions are significantly re- 
lated to educational achievement (Baron, 1982; Dreger, 1968; J. McV. Hunt, 1981). 
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Ata more basic level, there is a growing consensus that aptitudes can no longer 
be investigated independently of affective variables (Anastasi, 1985b, 1994; Izard, 
Kagan, & Zajonc, 1989; Kanfer, Ackerman, & Cudeck, 1989, Part IV; Moore & 
Isen, 1990; Saklofske & Zeidner, 1995; Salovey & Sluyter, 1997; 
R. E. Snow, 1992; Spaulding, 1994; Sternberg & Ruzgis, 1994). 

The effects of transitory affective states on the individual’s current performance 
are well established. Even more important is the cumulative effect of personality 
traits on the direction and extent of the individual's intellectual development. Rele- 
vant data come from several types of research, including long-term longitudinal 
studies (Eichorn, Clausen, Haan, Honzik, & Mussen, 1981) and, more recently, the 
application of structural equation modeling to the analysis of causal relations 
(Shavelson & Bolus, 1982). Such studies provide evidence that the prediction of a 
person's subsequent intellectual development can be substantially improved by com- 
bining information about motivation and attitudes with scores on aptitude tests. 

One of the ways whereby motivation and other affective variables may con- 
tribute to the development of aptitudes is through the amount of time the indi- 
vidual spends on a particular activity relative to other activities competing for 
attention. On the basis of some 25 years of research on achievement motivation, 
J. W. Atkinson and his co-workers (Atkinson, 1974; Atkinson, O'Malley, & Lens, 
1976) formulated a comprehensive schema representing the interrelationships of 
abilities, motivation, and environmental variables. A key concept in this schema 
is time-on-task, namely, the time an individual devotes to one kind of activity, 
such as studying or carrying out a job-related function. Motivation influences 
both the efficiency with which a task is performed and the time spent on it rela- 
tive to other activities. Level of performance depends upon the individual's rele- 
vant abilities and the efficiency with which he or she applies those abilities to the 
current task. The final achievement or product reflects the combined influence of 
level of performance and time spent at work. : : 

Another major component of the Atkinson schema pertains to the lasting, cu- 
mulative effect of task performance on the individual's own cognitive and moti- 
vational development. This step represents a feedback loop to the individual's 
own traits. Its influence should be manifested in that person's future performance, 
both on tests and in real-life activities. The predictive value of the Atkinson 
schema was demonstrated with both computer simulations and empirical analyses 
of longitudinal data on high school students (Atkinson, 1974; Atkinson et al., 
1976; Lens, Atkinson, & Yip, 1979). 

The effect of sheer time-on-task is enhanced by attention control. What one 
attends to, how deeply attention is focused, and how long attention is sustained 
contribute to one's cognitive growth. The selectivity of attention leads to selec- 
tive learning—and this selection will differ among persons exposed to the same 
immediate situation. Such selective learning, moreover, may influence the rela- 
tive development of different aptitudes and thereby contribute to the formation 
i, 1970, 1983a, 1986b). Basically, the several as- 


of diverse trait patterns (Anastasi; : cto vel 
pects of attention control serve to intensify the effect of time devoted to relevant 
de development. 


activities and hence increase its influence on aptitu 
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The relation between personality and intellect is reciprocal. Not only do per- 
sonality characteristics affect intellectual development, but intellectual level may 
also affect personality development. Suggestive data in Support of this relation 


of personality tests that had been administered to one or more of the samples. The 
personality tests include measures of attitudes, values, motivation, and interper- 
sonal and other noncognitive traits. The results of this analysis revealed a strong 
tendency for the high-aptitude groups to undergo substantially more “psychologi- 
cally positive” personality changes than did the low-aptitude groups. 


relations; and self-concept. In the self-concept; we can see most clearly the mu- 
tual influence of aptitudes and personality traits. The child’s achievement in 
school, on the playground, and in other situations helps to shape her or his self- 
concept; and this self-concept at any given:stage influences her or his subsequent 


tions between ratings of infant behavior on personality variables and subsequent 
cognitive development assessed by. such instruments as the WISC-R and the 
Stanford-Binet (Birns & Golden, 1972; R. B. McCall, 1976; Palisin, 1986; Yarrow & 
Pedersen, 1976). In general, infants who exhibit Positive affect, active interest, and 


More specifically, studies of the environmental-mastery motive in infants have 
revealed some Promising relations to subsequent measures of intellectual compe- 
tence. The infant's environmental-mastery behavior includes the observation, ex- 
ploration, and manipulation of her or his environment. By its very nature, this 


creasing evidence in the experimental literature that it does so contribute (Hrncir, 
Speller, & West, 1985; White; 1978; Yarrow et al. 1984; Yarrow et al., 1983). In 
fact, some of the findings suggest that early indicators of a child's environmental- 
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FACTOR ANALYSIS OF INTELLIGENCE IS OR 
resol each on adeleg, 
est in the nature and composition of human intelligence.* Such res n 


with. the intercorrelations of the scores obtained by a sample of persons on a wide 
variety of ability tests. The correlation table is then subjected to further mathe- 
matical analyses, in order to identify common factors or traits among the tests. 
The various procedures currently available for this purpose are designated. by the 
generic name of factor analysis« i 2 la Or iod dpt 


Tei sedie d 


The Factor Matrix... The principal object of factor analysis is to simplify the 
description of data by reducing the number of necessary. variables, or dimensions. 
Thus, if we find that five factors are sufficient to account for all the, common 
variance in a battery of 20 tests, we can for most purposes substitute 5 scores for 
the original 20 without sacrificing essential information. The usual practice is to 
retain from among the original tests those providing the best measures of each of * 
the factors. ; 

All techniques of factor analysis begin. with a complete table of intercorrela- 
tions among a set of tests. Such a table is known as a correlation matrix. Every 
factor analysis ends with a factor matrix, that is, a table showing the weight or 
loading of each of the factors in each test. A hypothetical factor matrix involying 
only two factors is shown in Table 11-1. The factors are listed across the top and 
their weights in each of the 10 tests are given in the appropriate rows. 

Several different methods for analyzing a set of variables into common factors 
have been derived. As early as 1901, Pearson pointed the way for this type of 
analysis; and Spearman (1904, 1927) developed a precursor of modern factor 


A Hypothetical Factor Matrix | : 


Test , Factor I Factor II 


D Jicken: / 


*For some historical perspective; see Anastasi (1984b). - in 
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Kim and Mueller (1978a, 1978b) and P. Kline (1993). 

It is beyond the scope of this book to cover the mathematical basis or the com- 
putational procedures of factor analysis. An understanding of the results of factor 
analysis, however, need not be limited to those who hàve mastered its specialized 
methodology. Even without knowing how the factor loadings were computed, it is 
possible to see how a factor matrix is utilized in the identification and interpreta- 
tion of factors. For an effective reading of reports of factor-analytic research, how- 
ever, familiarity with a few other concepts and terms is helpful. 


The References Axes. It is customary to represent factors geometrically as 
references axes in terms of which each test can be plotted. Figure 11—1 illustrates 
this procedure. In this graph, each of the 10 tests from Table 11-1 has been 
plotted against the two factors, which correspond to axes I and II. Thus, the point 


Fi gure 11-1, A Hypothetical Factor Pattern, Showing 
Weights of Two Group Factors in Each of 10 Tests. 
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representing Test 1 is located by moving 24 of the distance along axis Land .54 of 
the distance along axis II. The points corresponding to the remaining 9 tests are 
plotted in the same way, using the weights given in Table 11—1. Although all the 
weights on Factor I are positive, it will be noted that on Factor II some of the 
weights are positive and some negative. This can also be seen in Figure 11-1, 
where Tests 1 to 5 cluster imone part of the graph and Tests 6 to 10 in another. 

In this connection it should be noted that the position of the reference axes is 
not fixed by the data. The original correlation table determines only the position 
of the tests (i.e., points in Fig. 11-1) in relation to each other. The same points can 
be plotted with the reference axes in any position. For this reason, factor analysts 
usually rotate axes until they obtain the most satisfactory and easily interpretable 
pattern. This is a legitimate procedure, somewhat analogous to measuring longi- 
tude from, let us say, Chicago rather than Greenwich. 

The reference axes in Figure 11-1 were rotated to position I’ and II^, shown by 
the broken lines.? This rotation was carried out in accordance with Thurstone's 
criteria of positive manifold and simple structure. The former requires the rotation of 
axes to such a position as to eliminate all significant negative weights. Most psy- 
chologists regard negative factor loadings as logically inapplicable to aptitude 
tests, since such a loading implies'that the higher the individual rates in the par- 
ticular factor, the poorer will be her or his performance on the test. The criterion 
of simple structure means essentially that each test shall have loadings on as few 
factors as possible. Both of these criteria are designed to yield factors that can be 
most readily arid unambiguously interpreted. Ifa test has a high loading on a sin- 
gle factor and no significant loading on any other, we can learn something about 
the nature of the factor by examining the content of the test. If instead the test 
has moderate to low loadings’on six factors, it can tell us little about the nature of 
any of these factors. 

It will be seen that on the rotated axes in Figure 11-1 all the verbal tests (Test 
I to 5) fall along or very close to axis I^. Similarly, the numerical tests (Test 6 to 
10) cluster closely around axis II’) The new factor loadings, measured along the 
rotated axes, are given in Table 11-2. The reader may easily verify these factor 
loadings by preparing a paper "ruler" with a scale of units corresponding to that in 
Figure 11-1. With this ruler, distances can be measured along the rotated axes. 
The factor loadings in Table 11-2 include no negative values except for very low, 
negligible amounts attributable to sampling errors. All of the verbal tests have 
high loadings on Factor I’ and practically zero loadings on Factor I’. The numeri- 
cal tests, on the other hand, have high loadings on Factor II’ and low, negligible 


loadings on Factor l’. The identification and naming of the two factors and the 


description of the factorial composition of each test have thus been simplified by 


is Ii’ should have been labeled A, to correspond to the unrotated -ll. 


"Th 
e reader may feel that rotated Ax arbritrary matter. In the present example, 


Which pole of the axis is labeled plus and which minus, however, is an ar? 
the rotated axis II’ has been “reflected” in order to eliminate negative weights. tly fr Such 

SThis criterion requires that tests have loadings on some factors that do nor differ significan dy HO P Ge 
requirement can Se be tested empirically through available statistical procedures for finding the standard error 


of factor loadings (Cudeck & O'Dell, 1994). 
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Rotated Factor Matrix 


the rotation of reference axes. In actual practice, the number of factors is often 
greater than two—a condition. that complicates the geometrical representation 


Invariance, or the corroboration of the same factors in independent but compara- 
ble investigations, is also taken into account. There is currently an increasing use 
of factor analysis in a confirmatory rather than an exploratory role. It is often 
combined with structural equation modeling (see chap; 5) in evaluating a theo- 
retically formulated model of the operation of different variables in. task perfor- 
mance (see, e.g., Loehlin, 1992). 


Interpretation of Factors. Once the rotated factor matrix has been com- 
puted, we can proceed with the interpretation and naming of factors. This step 


Factor H is verbal and Factor Il is numerical. 

The factor loadings given in Table 11—2 also represent the correlation of each 
test with the factor." It will be recalled that this correlation is the factorial validity 
of the test (chap. 5). From Table 11-2 we can say, for instance, that the factorial 
validity of the Vocabulary test as a measure of the verbal factor is .91. The factor- 
ial validity of the Addition test, in terms of the numerical factor, is 55. Obviously, 


"This is true only when an orthogonal rotation of axes is applied, With oblique rotation; to be defined in a later 
section of this chapter, factor loadings and factor correlations bear a simple relation to each other and each can 
be found from the other by appropriate computations. 
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the first five tests have negligible validity as measures of the numerical factor, and 
the last five have practically no validity as measures of the verbal factor. 


Factorial Composition of a Test. One of the basic theorems of factor 
analysis states that the total variance of a test is the sum of the variances 
contributed by the common factors (shared with. other tests) and the specific 
factors (occurring in that test alone), plus the error variance. We have already 
encountered error variance in the analysis of test scores (chap. 4). If, for instance, 
the reliability coefficient of a test is .83, we conclude that 17% of the variance of 
scores on this test is error variance (1.00 — .83 =.17). Through factor analysis, we 
can further subdivide the sources of variance contributing toward performance on 
any test. 

Let us consider two hypothetical tests listed in Table 11—3. For each test, the 
table gives its factor loading in Verbal (V), Numerical (N), and Reasoning (R) 
factors, as well as its reliability coefficient; Since factor loading also represents the 
correlation between the test and the factor, the square of the factor loading gives 
us the proportion of common variance between the test and that factor. In the 
last section of the table, each factor loading has been squared to show the propor- 
tional contribution of that factor to the total variance of test scores. Thus, we find 
that in the Arithmetic Reasoning test 16% of the.variance is attributable to the 
Verbal factor, 30% to the Numerical factor, and 36% to the Reasoning factor. The 
error variance in the last column is found by simply subtracting the reliability co- 
efficient from the total variance (1.00 —.90=.10), Whatever is left represents the 
specificity of this test, that is, that portion of its “true” variance it does not share. 
with any other test with which it was factor analyzed. For the Arithmetic Reason- 
ing test, we have: 


16 +.30+.36+ .10 2.92 
1.00 -.92 2.08 


Figure 11-2 provides a pictorial breakdown of the sources of variance for the two 
tests on Table 11-3. | 

Any individual's performance on these two tests depends on the amounts of 
each of the relevant abilities or factors he or she possesses, as well as the relative 
weights of these factors in the particular test. Thus, if we had the individual's 
score in the V, N, and R factors, expressed in the same units, we could weight 
each score by multiplying it by the corresponding factor loading: The sum of 
these products would provide an estimate of the individual's score on the test. 
The smaller the contribution of specific and error factors to the test, the more ac- 
curate this estimate would be. VEER : 

In the example given in Table 11-3, if an individual rates very high in V, this 
would help her or him much more on the Arithmetic Reasoning than on the 
Multiplication test. In fact, it would help four times as much, since the weight of 
the V factor is four times as great in Arithmetic Reasoning as in Multiplication 
(40 vs. .10). Of the three common factors, N would have the greatest effect on 
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Source of Variance of Test Scores 


Proportional Contribution 
AME TN R Specific 


Error 


Multiplication (loading = .70) and R would have the greatest effect on Arith- 
metic Reasoning (loading = .60). 


Factor Loadings and Correlation. “A second basic theorem of factor analysis 
concerns the relationship between factor loadings and the correlations among 
variables. The correlation between any two variables is equal to the sum of the 
cross-products of their common-factor loadings. Since specific and error factors 
are unique to each variable, they cannot contribute to the correlation between 
variables. The correlation between any two variables depends only on the factors 
that these two variables share. The larger the weights of these common factors in 
the two variables, the higher will be the correlation between the variables. The 


e MUT Do S BE EO Ka 


CALL NC aes, 


Figure 11-2, Percentage of Common, Specific; and Error Variance in Two 
Hypothetical Tests. gs 


(Data from Table 11-3.) 
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correlation between the two tests given in Table 11-3. can be found by 
multiplying the loadings of each of the three common factors in the two tests and 
adding the products as shown below: 


r12 (40)(.10) + (.55)(.70) + (.60)(.30) =.60 


Oblique Axes and Second-Order Factors. The axes employed in. Figure 
11-1 are known as orthogonal. axes, since. they are at right angles to each other. 
Occasionally, the test clusters are so situated that a better fit can be obtained with 
oblique axes. In. such a ease, the factors would themselves be correlated, Some 
investigators have maintained that orthogonal, or uncorrelated, factors should 
always be employed, since they provide a simpler and clearer picture of trait 
relationships. Others insist that oblique axes should be used when. they fit the 
data better, since the most meaningful categories need not be uncorrelated. An 
obvious example is that of height and weight. Although. it is well known. that 
height and weight are highly correlated, they have proved to be useful categories 
in the measurement of physique- 

When the factors are themselves correlated, it is possible to subject the inter- 
correlations among the factors to the same statistical analysis we employ with in- 
tercotrelations among tests. Jo other words, we can “factorize the factors” and 
derive second-order factors. This process has been followed in a number of studies 
with both aptitude and personality variables. Certain investigations with aptitude 
tests have yielded a single second-order general factor. As a rule, American factor 
analysts have proceeded by accounting for as much of the common variance as 
possible through group factors and then identifying a general factor as a second- 
order factor if the data justified it. British psychologists, on the other hand, have 
usually begun with a general factor, to which they attributed the major portion of 
the common variance, and then turned to group factors to account for any re- 
maining correlation. These procedural differences reflect differences in theoreti- 
cal orientation to be discussed in the following section. 


THEORIES OF TRAIT ORGANIZATION 


For more than half a century, the statistical techniques of factor analysis have 
been applied widely in an effort to understand the nature and organization of 
abilities involved in a diversity of human activities. Nevertheless, these tech- 
niques are still most closely associated with research on cognitive abilities or "in- 
telligence," where factor analysis originated. A recently completed survey of all 
available factor-analytic studies of cognitive abilities provides an impressive 
summary of such research (Carroll, 1993). Covering over 10 years of studies, this 
enterprise represents much more than a literature survey; it also includes the re- 


analysis of more than 450 data sets from the original studies. In the same source, 


different theoretical models of intelligence are described and evaluated within 


their historical framework. In the present section, we shall consider only a few 
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widely known theories of intelligence, selected because of their impact on the 
Construction and use of tests. 


The Two-Factor Theory. The frst theory of trait organization based on a 
statistical analysis of test scores was the two-factor theory developed by the 
British psychologist Charles Spearman (1904, 1927). In its original formulation, 
this theory maintained that all intellectual activities share a single common 
factor, called the general factor, or g. In addition, the theory postulated numerous 
specifics, or s.factors, each being strictly specific to à single activity. Positive 
correlation between any two functions was thus attributed to the g factor. The 
more highly the two functions were "saturated" with g, the higher the correlation 
between them would be. The presence of specifics, on the other hand, tended to 


Although two types of factors, general and specific, are posited by this theory, it 
is only the single factor g that accounts for correlation. In contrast to other theo- 
ries of trait relations, therefore, it could be more precisely characterized às a single- 
factor theory, although the original designation has persisted. Figure 11-3 
illustrates the basis for correlation among tests according to this theory. In this il- 
lustration, tests 1 and 2 would correlate highly with each other since each is 
highly saturated with g, as shown by the shaded areas. The white area in each test 
represents specific and error variance. Test 3 would have low: correlations with 
each of the other two tests, since it contains very little g 


MPO CIE 


Figure 11-3, Correlation Model Underlying Two-Factor 
Theory. 
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It follows from the two-factor theory that the aim of psychological testing 
should be to measure the amount of each individual’s g. If this factor runs through 
all abilities, it furnishes the only basis for prediction of the individual's perfor- 
mance from one situation to another. It would be futile to measure specific fac- 
tors, since each by definition operates in only a single activity. Accordingly, 
Spearman proposed that a single test, highly saturated with g, be substituted for 
the heterogeneous collection of items found in intelligence tests. He suggested 
that tests dealing with abstract relations are probably the best measures of g and 
could be used for this purpose. Examples of tests constructed as measures of g in- 
clude Raven’s Progressive Matrices and Cattell’s Culture Fair Intelligence Test. 

From the outset, Spearman realized that the two-factor theory must be qualified. 
When the activities compared are similar, a certain degree of correlation may result 
over and above that attributable to the g factor. Thus, in addition to general and 
specific factors, there might be another, intermediate class of factors, not so univer- 
sal as g nor so strictly specific as the s factors. Such a factor, common to a group of 
activities but not to all, was designated a group factor. In the early formulation of his 
theory, Spearman admitted the possibility of very narrow and negligibly small group 
factors. Following later investigations by several of his students, he included much 
broader group factors such as arithmetic, mechanical, and linguistic abilities. 


Multiple-Factor Theories. The prevalent American view of trait organiza- 
tion, based on early factor-analytic research, recognized a number of moderately 
broad group factors, each of which could enter with different weights into dif- 
ferent tests. For example, a verbal factor could have a large weight in a vocabulary 
test, a smaller weight in a verbal analogies test, and a still smaller weight in. an 
arithmetic reasoning test. Figure 11—4 illustrates the intercorrelations among five 
tests in terms of a multiple-factor model. The correlations of Tests 1, 2, and 3 with 
each other result from their common loadings with the verbal factor (V). 
Similarly, the correlation between Tests 3 and 5 results from the Spatial factor 
(S), and that between Tests 4 and 5 from the Number factor (N). Tests 3 and 5 
are factorially complex, each having appreciable loadings in more than one 
factor: V and S in Test 3, N and S in Test 5. From the second basic theorem of 
factor analysis, discussed in the preceding section, we can also tell something 
about the relative magnitude of the intercorrelations. For example, Test 3 will 
correlate higher with Test 5 than with Test 2 because the weights of the S factor 
$ Tests 3 and 5 (diagonally striped € are cen than the weights of the V 
actor in Tests 2 and 3 (horizontally striped areas). 

The publication of T. L. Kelley's Crossroads in the Mind of Man (1928) paved 
the way for a large number of studies in quest of particular group factors. Chief 
among the factors proposed by Kelley were manipulation of spatial relationships, 
facility with numbers, facility with verbal material, memory, and speed. This list 
has been modified and extended by subsequent investigators employing the more 
modern methods of factor analysis discussed in the preceding section. r 

One of the leading exponents of multiple-factor theory was Thurstone: On the 
basis of extensive research by himself and his students, Thurstone proposed about 
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Figure 11-4. Correlation Model Underlying Multiple-Factor Theories. 


a dozen group factors which he designated "primary mental abilities." Those most 
frequently corroborated in the work of Thurstone and of other independent in- 
vestigators (French, 1951; Harman, 1975; Thurstone, 1938; Thurstone & Thur- 
stone, 1941) include the following: 


V. 


Verbal Comprehension: The principal factor in such tests as reading compre- 
hension, verbal analogies, disarranged sentences, verbal reasoning, and 
proverb matching. It is most adequately measured by vocabulary tests. 

Word Fluency: Found in such tests as anagrams, rhyming, or naming words in 
a given category (e.g., boys’ names, words beginning with the letter T). 
Number: Most closely identified with speed and accuracy of simple arith- 
metic computations. 

Space: This factor may represent two distinct factors, one covering percep- 
tion of fixed spatial or geometric relations, the other manipulatory visualiza- 


tions, in which changed positions or transformations must be visualized 
(McGee, 1979; Portegal, 1982). 


Other restricted memory factors, such as memory for temporal sequences 
and for spatial position, have been suggested by some investigations. 
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P. Perceptual Speed: Quick and accurate grasping of visual details, similarities, 
and differences. This factor may be the same as the speed factor identified by 
Kelley and other early investigators. This is one of several factors subse- 
quently identified in perceptual tasks (Thurstone, 1944). 

1.. (or R). Induction (or General Reasoning): The identification of this factor was 
least clear. Thurstone originally proposed an inductive and a deductive fac- 
tor. The latter was best measured by tests of syllogistic reasoning and the for- 
mer by tests requiring the respondent to find a rule, as in a number series 
completion test. Evidence for the deductive factor, however, was much 
weaker than for the inductive. Moreover, other investigators suggested a 
general reasoning factor, best measured by arithmetic reasoning tests. 


It should be noted that the distinction between general, group, and specific 
factors is not so basic as may at first appear. If the number or variety of tests in a 
battery is small, a single general factor may account for all the correlations among 
them. But when the same tests are included in a larger battery with a more het- 
erogeneous collection of tests, the original general factor may emerge as a group 
factor, common to some but not all tests. Similarly, a certain factor may be repre- 
sented by only one test in the original battery, but may be shared by several tests 
in the larger battery. Such a factor would have been identified as a specific in the 
original battery, but would become a group factor in the more comprehensive bat- 
tery. In the light of these considerations, it is not surprising to find that intensive 
factorial investigations of special areas have yielded many factors in place of the 
one or two primary mental abilities originally identified in each area. Such has 
been the case in studies of verbal, perceptual, memory, and reasoning tests. 

Factorial research seems to have produced a bewildering multiplication of fac- 
tors. The number of cognitive factors reported to date by different investigators is 
considerably over 100. A certain amount of order was achieved by cross-identifi- 
cation of factors reported by different investigators and often given different 
names (Ekstrom, French, & Harman, 1979; French, 1951; Harman, 1975). Such 
cross-identification can be accomplished when there are several tests common to 
the investigations being compared. To facilitate this process, a group of factor an- 
alysts assembled a kit of “reference tests” measuring the principal aptitude factors 
identified at that time, This kit, distributed by Educational Testing Service (Ek- 
strom, French, Harman, & Dermen, 1976; ETS kit, 1976), made it easier for dif- 


ferent investigators planning factorial research to include some common tests In 


their batteries. vs CN 
fter these efforts at simplification and coordination, 


It is apparent that even al E d 
the number of factors remains large. Human behavior is varied and complex, an 
de an adequate description 


it is unrealistic to expect a dozen or so factors to provi i 
of it. For specific purposes, however, we can choose appropriate factors with re- 
gard to both nature and breadth. For example, if we were selecting applicants for a 
difficult and highly specialized mechanical job, we would probably want to mea- 
sure fairly narrow perceptual and spatial factors that closely match the job re- 
quirements. In selecting college students, on the other hand, a few broad factors 
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such as verbal comprehension, numerical facility, and general reasoning would be 
most relevant. 


Structure-of-Intellect Model, Some factor analysts have tried to simplify the 
picture of trait relationships by organizing the traits into a systematic schema. On the 
basis of more than two decades of factoranalytic research, Guilford (1967, 1988; 
Guilford & Hoepfner, 1971) proposed a boxlike model, which he called the 
structure-of-intellect (SI) model. Illustrated in Figure 11-5, this model? classifies 
intellectual traits along three dimensions: 


Operations—what the respondent does. These include cognition, memory 
recording, memory retention, divergent production (prominent in creative 
activity), convergent production, and evaluation. f 

Contents—the nature of the materials or information on which operations are 
performed. These include visual, auditory, symbolic (e.g., letters, numbers), 
semantic (e.g., words), and behavioral (information about other persons’ be- 
havior, attitudes, needs, etc.). 

Products—the form in which information is processed by the respondent. Prod- 
ucts are classified into units, classes, relations, systems, transformations, and 
implications. 


models of trait organization, provides one schema for representing the obtained 
correlations among variables. Because of the method employed in the rotation of 
axes, the empirical corroboration of the SI model does not preclude other models 
(Carroll, 1972; Horn & Knapp, 1973). Different rotations of the same data could 
be found to fit other models equally closely. Re-analyses of Guilford's original 
factor-analytic data have, in fact, found that other models fitted the data better 
than the SI model and were also more consistent with theoretical and practical 
interpretations (Bachelor, 1989; Carroll, 1993 = 


ERAN ée 

a a SC g ERUR a research, this model differs from the earlier model (Guilford, 1967) by the separa- 
ton ot the figural content facet into visual and auditi int 

recording and retention (Guilford, 1988), DE e Edge 


°For relevant references, see Carroll (1993), especially pp. 57.60. 
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Figure 11-5. Three-Dimensional Model of the Structure of Intellect. 
(As revised by Guilford, 1988, p. 3. Copyright € 1988 by Educational and Psychological Measurement. Repro- 


duced by permission.) 


On the other hand, a favorable indirect effect of the SI project is the focusing of 


attention on the distinction between operation and content in the identification 
both the factors identified through 


of factors. This distinction has helped to clarify 
factor analysis and the processes investigated through cognitive psychology, as 
well as the relation between the two. More will be said about this point in the 
final section of this chapter, dealing with the nature and development of traits. 


Another indirect effect of the SI model is its differentiation between convergent 
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and divergent thinking. The latter concept, with its implication of atypical be- 
havior, has been widely adopted in analyses of creativity. Efforts to develop tests of 
divergent thinking, independent of the content domain, however, have generally 
proved unsuccessful. Divergent thinking and creative productivity have proved to 
be specific to the field or subject-matter to which they are applied, such as a par- 
ticular science (e.g., biology or physics) or a particular art form (e.g., music or 
sculpture). Moreover, a creative idea or product must be meaningful or useful 
within its particular culture; mere divergence, without qualitative advancement, 
does not constitute creativity. Research on the broad topic of creativity itself, in 
whatever context and form manifested, has been Progressing apace. The growing 
interest in this field is illustrated by the series of volumes on creativity research 
published in the 1990s, under the editorship of R. S. Albert ( 1991-1994). The in- 
dividual books included in this series provide a wide coverage of the topic, from 
the varied manifestations of creativity to ways of fostering and developing creative 
behavior. 


Hierarchical Theories. An alternative schema for the organization of factors 
was proposed by a number of British psychologists, including Burt (1949) and 
Vernon (1960), and by Humphreys (1962) in the United States. A diagram 
illustrating Vernon's application of this System is reproduced in Figure 11—6. At 
the top of the hierarchy, Vernon placed Spearman’s g factor. At the next level are 
two broad group factors, corresponding to verbal-educational (v:ed) and to 
practical-mechanical (k:m) aptitudes. These major factors may be further 
subdivided. The verbal-educational factor, for example, yields verbal and 
numerical subfactors, among others. Similarly, the practical-mechanical factor 
splits into such subfactors as mechanical-information, Spatial, and psychomotor 


Figure 11-6. Modelofa Hierarchical Organization of Abilities. 
(Adapted from P. E. Vernon, 1960, p. 22. Copyright O 1960, Methuen & Co., Ltd. Reproduced by permission.) 
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abilities. Still narrower subfactors can be identified by further analysis, let us say, 
of the verbal tasks. At the lowest level of the hierarchy are the specific factors. In 
a later elaboration of the model, Vernon (1969) included certain more complex 
interrelations and cross-contributions of factors at the third level, especially in 
connection with educational and vocational achievement. For example, scientific 
and technical abilities are linked to both spatial abilities and mechanical 
information; mathematical abilities are linked to both spatial and number 
abilities, as well as more directly to the g factor through an induction factor. 

Humphreys (1962, 1970) also recommended a hierarchical model as a means 
of coping with the proliferation of factors. Rather than considering any single 
level of factors as primary, however, he suggested that each test constructor or 
user choose that level of the hierarchy that is most appropriate for her or his pur- 
poses. Humphreys recognized, moreover, that a single test may be classified into 
more than one hierarchy, with reference to content, process, and other facets. To 
measure any one facet, he proposed that the test be made heterogeneous with re- 
gard to all other facets. For example, if we are interested in the person's ability to 
solve analogies problems, we should use a test that includes verbal, numerical, 
pictorial, and spatial analogies. If we wish to measure verbal ability, we should do 
so with a variety of item types, such as vocabulary, analogies, and series comple- 
tion. This procedure contrasts with that employed by Guilford, who sought sepa- 
rate factors (and tests) for each homogeneous cell in his three-way classification. 
In his later work, however, Guilford (1981) followed a partial hierarchical organi- 
zation in identifying higher-order factors among some of the factors in his original 
structure-of-intellect model. 

The hierarchical model of intelligence is being accepted more and more widely 
for both theoretical and practical purposes (see Anastasi, 1992a, 1994; Carroll, 
1993; Gustafsson, 1984, 1989; Lubinski & Dawis, 1992). As a theoretical model 
of trait relations, it reconciles the single general factor (Spearman's g) with multi- 
ple-factor patterns. Methodologically, it has been shown that the multiple-factor 
and the hierarchical solutions are mathematically equivalent and transposable 
from one to the other (Harman, 1976, chap. 15; Schmid & Leiman, 1951). The 
oblique-axis solution (with correlated factors) that leads to the hierarchical 
model can be transposed into an orthogonal-axis solution (with uncorrelated fac- 
tors). In the latter solution, the second-order factors emerge as first-order factors 
of different breadth. The broader factors have loadings on more variables than do 
the narrower factors. : 

From a practical standpoint, a major advantage of tests developed from a hier- 
archical pattern is that such tests combine comprehensive aptitude coverage with 
flexibility of use. To fit different testing purposes, the test user can choose a single 
overall battery score, or one or more scores on. clusters of tests measuring more 
narrowly defined factors. Under certain circumstances, scores on individual sub- 
tests may be useful in identifying either weakness or strength in specialized skills. 

The increasing impact of the hierarchical model on the corem and use 
of ability tests was cited in chapter 2, as a historical development that unified tra- 
ditional intelligence tests. with multiple aptitude batteries. Examples of such 
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hierarchical approach were given in chapter 8 for individual tests and in chapter 
10 for group tests. Among the clearest examples of the combination of compre- 
hensive coverage and flexible use are the Differential Ability Scales (Elliott, 
1990b), for individual testing, and the Multidimensional Aptitude Battery (Jack- 
son, 1994b), for group testing. A noteworthy feature of the effective flexibility of 
scoring levels in such tests is the provision of norms for the interpretation of scores 
at the chosen level. Without such separate-level norms, à proper evaluation of the 
test taker’s performance at different levels of the hierarchy would be unavailable. 


NATURE AND DEVELOPMENT OF TRAITS 


That different investigators may arrive at dissimilar models of trait organization 
becomes less perplexing when we recognize that the traits identified through fac- 
tor analysis are simply an expression of correlation among behavior measures. 
They are not underlying entities or casual factors, but descriptive categories. 
Hence, it is conceivable that different principles of classification may be applica- 
ble to the same data. The concept of factors as descriptive categories was explicit 
in the early writings of Thomson (1948), Burt (1941, 1944), and Vernon (1960) 
in England, and those of R. C. Tryon (1935) in America. All these writers called 
attention to the vast multiplicity of behavioral elements, which may become or- 
ganized into clusters through either hereditary or experiential linkages. 


Experiential History. There has been an increasing recognition of the role of 
the individual’s experiential history in the development of intelligence and the 
formation of group factors (Anastasi, 1986b; Greeno, 1989). It is not only the 
level of performance in different abilities but also the way in which performance 
is organized into distinct traits that is influenced by experiential background. 
Differences in factor patterns have been found to be associated with different 
cultures or subcultures, socioeconomic levels, and types of school curricula (see 
Anastasi, 1970, 1983a, 1986b, 1994; Vernon, 1969). Changes in factor patterns 
over time are also relevant. These include long-term changes—which may reflect 
the cumulative effect of everyday life experiences—as well as short-term changes 
resulting from practice and other experimentally controlled learning experiences 
(Baltes, Cornelius, Spiro, Nesselroade, & Willis, 1980; Birren, Cunningham, & 
Yamamoto, 1983; Fleishman, 1972; Fleishman & Mumford, 1989; Khan, 1970, 
1972; Reinert, 1970). Research on animals has also yielded suggestive evidence 
regarding the experimental production of factors by the control of early 
experiences (Whimbey & Denenberg, 1966). 

The factorial composition of the same objective task may differ among individ- 
uals with diverse experiential backgrounds. One reason for these individual differ- 
ences may be found in the use of different methods to carry out the same task. 
Individuals with highly developed verbal abilities, for example, will tend to utilize 
verbal mediators to solve a mechanical or spatial problem; those whose experi- 
ences have been predominantly mechanical, on the other hand, will tend to fol- 
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low a perceptual or spatial approach in solving the same problem. Relevant evi- 
dence was provided by French (1965), who found that the factorial composition 
of the same tests differed between groups of persons classified according to their 
typical problem-solving styles. Suggestive evidence can also be found in a study 
by C. H. Frederiksen (1969) on the cognitive strategies subjects use in memoriz- 
ing words. In the course of learning, individuals may change their choice of strat- 
egy and thereby alter the factorial composition of the task as performed by them. 
Similar shifts in aptitude requirements have been shown to occur over longer pe- 
tiods of time in the course of instruction (R. B. Burns, 1980). 


Mechanisms of Trait Formation. A mechanism for the emergence of factors 
is provided by the familiar concepts of learning set and transfer of training 
(Carroll, 1966; G. A. Ferguson, 1954, 1956; Simon, 1990; Whiteman, 1964). The 
establishment of learning sets enables the individual to learn more efficiently 
when presented with a new problem of the same kind. In Harlow's (1949, 1960) 
classic experiments with monkeys, after the animal had solved problems requiring 
the differentiation between certain shapes (like triangle and circle), it learned to 
discriminate between other shapes much more rapidly than did animals without 
such experience. The animal had established a learning set for differentiating 
shapes; it knew what to look for when faced with a new problem. The animal had 
thus "learned how to learn" this type of problem. 

Similarly, many of the skills developed through formal schooling, as in reading 
and arithmetic computation, are applicable to a wide variety of subsequent learn- 
ing situations. Efficient and systematic problem-solving techniques can likewise 
be applied to the solution of new problems. Individual differences in the extent to 
which these skills have been acquired will be reflected in the performance of a 
large number of different tasks; and in a factor analysis of these tasks, these widely 
applicable skills would emerge as broad group factors. The breadth of the transfer 
effect, or the variety of tasks to which a skill is applicable, would thus determine 
the breadth of the resulting group factor. 

Another major source of trait formation is the contiguity or co-occurrence of 
learning experiences. For example, persons in educationally advanced cultures are 
likely to develop a broad verbal-educational factor running though all activities 
learned in school. A. narrower factor of numerical aptitude may result from the 
fact that all arithmetic processes are taught together by the same teacher in the 
same classroom. Hence, the child who is discouraged, antagonized, or bored dur- 
ing the arithmetic period will tend to fall behind in learning all these processes; 
the one who is stimulated and gratified in the arithmetic class will tend to learn 
well all that is taught in that class period and to develop attitudes that will ad- 
vance his or her subsequent numerical learning. f 

However developed, the factors or abilities identified through factor analysis 
are descriptive categories, reflecting the changing interrelationships of perfor- 
mance in a variety of situations. These factors are not static entities but are them- 
selves the product of the individual's cumulative experiential history. Insofar as 
the interrelationships of experiences vary among individuals and groups, different 
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factor patterns may be expected among them. As the individual's experiences 
change—through formal education, occupational functions, and other continu- 
ing activities—new traits may become differentiated or previously existing traits 
may merge into broader composites. 


Factor Analysis and Cognitive Task Analysis. The more recently de- 
veloped information-processing procedures of Cognitive psychology are making 
significant contributions, not only to an understanding of what intelligence tests 
measure (see chap. 5) but also to an understanding of the formation and 
development of factors. Protocol analysis, whereby the individual is directed to 
“think aloud” as he or she solves a problem or performs an intellectual task, 
provides a promising approach to the analysis of human thinking (Ericsson & 
Simon, 1993). As research on cognitive processes continued, however, there was 
increasing evidence of the domain specificity of thinking processes. Except at the 
most elementary levels, processing skills tend to be specific to the type of content 
processed; and they are acquired as the content knowledge is mastered and 
organized for ready retrieval. 

On the other hand, the traits most commonly identified through factor analysis 
deal chiefly with content domains, such as verbal, numerical, and spatial. Those 
less common factors that are defined by processes, such as associative memory, 
general reasoning, or divergent thinking, are themselves found to be domain- 
specific when assessed through specially developed tests. Whether a test is derived 
from factor analysis or from cognitive analysis, the fundamental distinction is that 


specificity of all cognitive Processes, is reflected in the impressive growth of publi- 
cations that highlight domain specificity.!? It is also noteworthy that tests devel- 
oped to assess the content-defined traits derived by factor analysis have proved to 
be good predictors of everyday life performance. It is likely that such factors may 


may excel not only in the extent and organization of verbal knowledge but also 
on the use of specialized processing skills required in the verbal domain. 


of intelligence, wherever manifested. When considered from this broader 
viewpoint, human intelligence comprises that combination of cognitive skills and 
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knowledge demanded, fostered, and rewarded by the experiential context within 
which the individual functions (Anastasi, 1986c). In this generalized construct of 
intelligence, domain specificity is even more basic than it is in the narrower 
processes identified by protocol analysis or factor analysis. Many aspects of 
intelligence develop while the individual is learning factual knowledge and 
processing skills within a single domain, such as a particular cultural context or a 
specific occupation. : 

Early efforts to identify a single, universally applicable intelligence produced 
the traditional "intelligence tests." Soon such tests came to be known as measures 
of academic intelligence or scholastic aptitude, which was the particular intelli- 
gence they were actually assessing. The term “practical intelligence” was then 
proposed as a different kind of intelligence, not covered by traditional tests (see 
Anastasi, 1986b; Neisser, 1976; Sternberg & Wagner, 1986). Such practical intel- 
ligence, however, is not a single intelligence but rather a multiplicity of intelli- 
gences, applicable in different practical domains (Lave, 1988; Rogoff & Lave, 
1984). There are not two types of intelligence, academic and practical; intelli- 
gence is a multifaceted construct. 

A weakness of traditional factor analysis arises from the inadequate attention 
given to the choice of variables to be factor analyzed (Anastasi, 1988a). Obvi- 
ously, the factors identified by such an analysis are derived from the intercorrela- 
tions among the chosen variables. Most investigations of intelligence began with 
a set of scores on tests designed to measure school-taught performance. Hence the 
factors that emerged represented academic intelligence. In. educationally ad- 
vanced cultures, such factors prove to be good predictors of school achievement; 
they also yield moderately high correlations with performance on many jobs for 
which schooling itself is relevant. If we want to assess intelligence in broader con- 
texts, however, we need to start with people's performance in real-life, culturally 
valued activities. Although students in school are readily accessible participants 
in testing research, it is becoming evident that adults in various occupations are 
available. Research can be so organized as to benefit the participants and also 
yield well-controlled scientific data. There are increasing examples of such col- 
laborative research effectively conducted in industrial and farming contexts (e.g., 
Fleishman & Reilly, 1992b; Lubinski & Dawis, 1992; Whyte, 1991). 

Some of the procedures developed in cognitive psychology have themselves 
highlighted the domain specificity of intelligence. One approach to the investiga- 
tion of cognitive processes is through the protocol analysis of contrasted groups, 
such as experts and novices in a particular activity (eg. chess players, typists, com- 
puter programmers). It has been repeatedly established that the individual's perfor- 
mance level and employment of cognitive skills is strictly specific to the particular 
domain. A special advantage of this procedure of cognitive psychology is that the 


research has been conducted on a wide variety of activities from real-life contexts. 


Such activities range from waiters’ memory for restaurant orders to medical diagno- 


sis and judicial court decisions (Chi et al., 1988; Ericsson & Smith, 1991). 3 
The concept of domain specificity has been having an increasing effect on the 
psychological study of several topics. For example, it is clearly reflected in the cur- 
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tent definitions of genius and giftedness. The identification of gifted children has 
shifted from a specified IQ level (Terman et al., 1925) to superior talent in any of 
a multitude of socially desirable fields (Csikszentmihalyi, Rathunde, & Whalen, 
1993; Feldman & Bratton, 1972; Horowitz & O'Brien, 1985; Subotnik & Arnold, 
1994). Similarly, early efforts to design general process tests of creativity or diver- 
gent thinking have clearly encountered domain specificity (Baer, 1993; Runco, 
1991, 1994; Subotnik & Arnold, 1994). Some progress has now been made in 
integrating domain in tests of creativity. It appears that the union of cognitive 
psychology with psychometric factor analysis has not only enriched the understand- 
ing of intellectual functioning but has also brought both types of research closer to 
the world of daily living. 


Psychological Issues 
in Ability Testing 


n inevitable consequence of the expansion and growing complexity of any 
scientific endeavor fe ap increasing specialization of interests and functions 
among its practitioners. Such specialization is clearly apparent in the rela- 
tionship of psychological testing to the mainstream of contemporary psychol- 
ogy (Anastasi, 1967, 1991). Specialists in psychometrics have raised techniques of 
test construction to truly impressive pinnacles of quality. While providing techni- 
cally superior instruments, however, they have given relatively little attention to 
ensuring that test users had the psychological information needed for the proper 
use of such instruments. As a result, outdated interpretations of test performance 
all too often survived without reference to the results of pertinent behavioral re- 
search. This partial isolation of psychological testing from other fields of psychol- 
ogy—with its consequent misuses and misinterpretations of tests—accounts in 
part for the public discontent with psychological testing that began in the 1950s, 
Wer 'to-a strident ierescendo in the 1970s, and lingers in various contexts today. 
The topics chosen for discussion in this chapter illustrate ways in which the find- 
ings of psychological research can contribute to the effective use of ability tests 
and help to correct popular misconceptions about “the 1Q” and similar scores. 


LONGITUDINAL STUDIES OF CHILDREN’S INTELLIGENCE 


An important approach to the understanding of the construct “intelligence” is 
through longitudinal studies of the same individuals over long periods of time. 
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Although such investigations may be regarded as contributing to the long-term 
predictive validation of specific tests, they have broader implications for the na- 
ture of intelligence and the meaning of intelligence test scores. When intelli- 
gence was believed to be largely an expression of hereditary potential, each 
individual's “IQ” was expected to remain very nearly constant throughout life. 
Any observed variation on retesting was attributed to weaknesses in the measur- 
ing instrument—either inadequate reliability or poor selection of functions 
tested. With increasing research on the nature of intelligence, however, has come 
the realization that intelligence itself is both complex and dynamic. In the fol- 
lowing sections, we shall examine typical findings of longitudinal studies of intel- 
ligence and shall inquire into the conditions making for both stability and 
instability of tested abilities. 


Stability of Intelligence Test Performance. An extensive body of data 
has accumulated showing that, over the elementary, high school, and college 
period, intelligence test performance is quite stable (see Anastasi, 1958, pp. 
232-238; Bornstein & Krasnegor, 1989; McCall, Appelbaum, & Hogarty, 1973). 
In an early Swedish study of a relatively unselected population, for example, 
Husén (1951) found a correlation of -72 between the test scores of 613 third- 
grade schoolboys and the scores obtained by the same persons 10 years later, on 
their induction into military service. In a later Swedish study, Harnqvist (1968) 
reported a correlation of .78 between tests administered at 13 and 18 years of age 
to over 4,500 young men. Even preschool tests show remarkably high correlations 
with later retests. In a longitudinal study of 140 children conducted at Fels 
Research Institute (Sontag, Baker, & Nelson, 1958), Stanford-Binet scores 
obtained at 3 and at 4 years of age correlated .83. The correlation with the 3-year 
tests decreased as the interval between retests increased, but by age 12 it was still 
as high as .46, Of special relevance to the Stanford-Binet. is the follow-up 
conducted by Bradway, Thompson, and Cravens (1958) on children originally 
tested between the ages of 2 and 5/4 as part of the 1937 Stanford-Binet 
standardization sample. Initial IQs correlated .65 with 10-year retests and .59 
with 25-year retests. The correlation between the 10-year retest (mean age = 14 
years) and 25-year retest (mean age = 29 years) was .85. 

As would be expected, retest correlations are higher, the shorter the interval 
between tests. With a constant interval between tests, moreover, retest correla- 
tions tend to be higher the older the children. The effects of age and retest inter- 
val on retest correlations exhibit considerable regularity and are themselves 
highly predictable (R. L. Thorndike, 1933, 1940). One explanation for the in- 
creasing stability of Intelligence test scores with age is provided by the cumulative 
nature of intellectual development. The individual's intellectual skills and knowl- 
edge at each age include all her or his earlier skills and knowledge plus an incre- 
ment of new acquisitions. Even if the annual increments bear no relation to each 
other, a growing consistency of performance level would emerge, simply because 
earlier acquisitions constitute an increasing proportion of total skills and knowl- 
edge as age increases. 
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Although such overlapping of skills and knowledge at successive ages may ac- 
count for some of the increasing stability of test scores in the developing individ- 
ual, two additional conditions merit consideration. The first is the environmental 
stability characterizing the developmental years of the most persons. Hence, what- 
ever advantages or disadvantages they had at one stage of their development tend 
to persist in the interval between retests. Children tend to remain in the same 
family, the same socioeconomic level, and the same cultural milieu as they grow 
up. They are not typically shifted at random from intellectually stimulating to in- 
tellectually retarding environments. 

At the same time, it should be noted that the psychological environments of 
siblings in the same family are far from identical. There is a growing amount of 
carefully gathered data showing the various ways whereby the experiential history 
of siblings reared together may differ (Boer & Dunn, 1992; Dunn & Plomin, 
1990; Hetherington, Reiss, & Plomin, 1993). Having a younger or an older sib- 
ling is itself a significantly different psychological experience from that of being 
an only child. Nature of parental child-rearing practices, as well as parents’ own 
response to the child’s behavior, may also vary for siblings born at different times. 
Moreover, major events affecting family life—such as divorce, large shifts in fam- 
ily income, or moving from a rural to an urban environment may differently af- 
fect children who are themselves at different developmental stages. To all these 
potential differences must be added the cumulative experiences that different sib- 
lings may have outside the family: Hence, although familial continuity may con- 
tribute to stability of an individual child's test performance, it does not necessarily 
lead to sibling resemblance. 

A second condition contributing to the general stability of the individual's in- 
telligence test performance pertains to the role of prerequisite learning skills on sub- 
sequent learning. Not only does the individual retain prior learning, but much of 
her or his prior learning provides tools for subsequent learning. Hence, the more 
progress a child has made in the acquisition of intellectual skills and knowledge at 
any one point in time, the better able he or she is to profit from subsequent learn- 
ing experiences. The concept of readiness in education is an expression of this 
general principle. The sequential nature of learning is also implied in the previ- 
ously discussed Piagetian approach to mental development, as well as in various 
individualized instructional programs. 

Applications of the same principle underlie Project Head Start and other com- 
pensatory programs for educationally disadvantaged preschool children (Stanley, 
1972, 1973; Zigler & Valentine, 1980). Insofar as children from disadvantaged 
backgrounds lack some of the essential prerequisites for effective school learning, 
they would only fall farther and farther behind in academic achievement as they 
progressed through the school grades. It should be added that learning prerequi- 
sites cover not only such intellectual skills as the acquisition of language and of 
quantitative concepts, but also attitudes, interests, motivation, problem-solving 
styles, reactions to frustration, self-concepts, and other personality characteristics. 
The object of compensatory educational programs is to provide the learning pre- 
requisites that will enable children to profit from subsequent schooling: In so 
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doing, of course, these programs hope to disrupt the "stability" of IQs that would 
otherwise have remained low. 


Instability of Intelligence Test Performance. Correlation studies of test 
scores provide actuarial data, applicable to group predictions. For the reasons 
given above, scores tend to be quite stable in this actuarial sense. Studies of 
individuals, on the other hand, may reveal large upward or downward shifts in 
test scores. Sharp rises or drops may occur as a result of major environmental 
changes in the child’s life. Drastic changes in family structure or home conditions, 
adoption into a foster home, severe or prolonged illness, and therapeutic or 
temedial programs are examples of the type of events that may alter the child’s 
subsequent intellectual development. Even children who remain in the same 
environment, however, may show large increases or decreases in test scores on 
retesting. These changes mean, of course, that the child is developing at a faster 
or a slower rate than that of the normative population on which the test was 
standardized. In general, children in educationally disadvantaged environments 
tend to lose and those in favorable environments to gain in score with age. 
Investigations of the specific characteristics of these environments and of the 
children themselves are receiving increasing attention in current research (see, 
e.g., Carroll, 1993, pp. 669-674; Detterman & Sternberg, 1982). 

Extensive data on the magnitude of individual score changes in intelligence test 
performance were first provided by the California Guidance Study (Honzik, Mac- 
farlane, & Allen, 1948). An analysis of retest data on 222 cases from this study 
found individual IQ changes of as much as 50 points. Over the period from 6 to 18 
years, when retest correlations are generally high, 5996 of the children changed by 
15 or more IQ points, 37% by 20 or more points, and 9% by 30 or more. Nor were 
most of these changes random or erratic in nature. On the contrary, children exhib- 
ited consistent upward or downward trends over several consecutive years; and 
these changes were related to environmental characteristics. In the California 
Guidance Study, detailed investigation of home conditions and parent-child rela- 
tionships indicated that large upward or downward shifts in IQ were associated with 
the cultural milieu and emotional climate in which the child was reared. A further 
follow-up conducted when the Participants had reached the age of 30 still found 
significant correlations between test scores and family milieu as assessed at the age 
of 21 months (Honzik, 1967). Parental concern with the child’s educational 
achievement emerged as an important correlate of subsequent test performance, as 
did other variables reflecting parental concern with the child's general welfare. 

Some investigators concentrated more specifically on the personality charac- 
teristics associated with intellectual acceleration and deceleration. At the Fels 
Research Institute, 140 children were included in an intensive longitudinal study 
extending from early infancy to adolescence and beyond (Kagan & Freeman, 
1963; Kagan, Sontag, Baker, & Nelson, 1958; Sontag et al., 1958). Within this 
group, those children Showing the largest gains and those showing the largest 
losses in IQ between the ages of 4/4 and 6 were compared in a wide variety of per- 
sonality and environmental measures; the same was done with those showing the 
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largest IQ changes between 6 and 10. During the preschool years, emotional de- 
pendency on parents Was the principal condition associated with IQ loss. During 
the school years, IQ gains were associated chiefly with high achievement drive, 
competitive striving, and curiosity about nature. Suggestive data were likewise 
obtained regarding the role of parental attitudes and child-rearing practices in the 
development of these traits. 

A later analysis of the same sample, extending through age 17, focused princi- 
pally on patterns of score change over time (McCall et al., 1973). Children ex- 
hibiting different patterns were compared with regard to child-rearing practices as 
assessed through periodic home visits. A typical finding was that the parents of 
children whose scores showed a rising trend during the preschool years presented 
“an encouraging and rewarding atmosphere, but one with some structure and en- 
forcement of policies" (McCall et al., 1973, p. 54). A major condition associated 
with rising scores is described as accelerational attempt, or the extent to which 
"the parent deliberately trained the child in various mental and motor skills 
which were not yet essential" (p. 52). 

Research on the factors associated with increases and decreases in intelligence 
test scores throws light on the conditions determining intellectual development 
in general. It also suggests that prediction of subsequent intellectual status can be 
improved if measures of the individual's emotional and motivational characteris- 
tics and of her or his environment are combined with initial test scores. From still 
another viewpoint, the findings of this type of research point the way to the kind 
of intervention programs that can effectively alter the course of intellectual de- 
velopment in the desired directions. 


INTELLIGENCE IN EARLY CHILDHOOD 


The assessment of intelligence at the two extremes of the age range presents spe- 
cial theoretical and interpretive problems. One of these problems pertains to the 
functions that should be tested. What constitutes intelligence for the infant and 
the preschool child? What constitutes intelligence for the older adult? The sec- 
ond problem is not entirely independent of the first. Unlike the schoolchild, the 
infant and preschooler have not been exposed to the standardized series of experi- 
ences represented by the school curriculum. In developing tests for the elemen- 
tary, high school, and college levels, test constructors have a large fund of 
common experiential material from which they can draw test items. Prior to 
school entrance, on the other hand, the child’s experiences are far less standard- 
ized, despite certain broad cultural uniformities in child-rearing practices. Under 
these conditions, both the construction of tests and the interpretation of test re- 
sults are much more difficult. To some extent, the same difficulty is encountered 
in testing older adults, whose schooling was completed many years earlier an 

who have since been engaged in highly diversified activities. In this and the next 
section, we shall examine some of the implications of these problems for early 


childhood and adult testing, respectively. 


342 ABILITY TESTING 


Predictive Validity of Infant and Preschool Tests. The conclusion that 
emerges from longitudinal studies is that preschool tests (especially when 
administered after the age of 2 years) have moderate validity in predicting 
subsequent intelligence test performance, but that infant tests have virtually 
none (Bayley, 1970; Lewis, 1973; McCall, Hogarty, & Hurlburt, 1972). 
Combining the results reported in eight studies, McCall and his associates (1972) 
computed the median correlations between tests administered during the first 30 
months of life and childhood IQ obtained between 3 and 18 years. Several trends 
are apparent in their findings. First, tests given during the first year of life have 
little or no long-term predictive value. Second, infant tests show some validity in 
predicting IQ at preschool ages (3—4 years), but the correlations exhibit a sharp 
drop beyond that point, after children reach school age. Third, after the age of 18 
months, predictive validities are moderate and stable, being mostly in the '40s 
and '50s. When predictions are made from these ages, the correlations seem to be 
of the same order of magnitude, whether the retest occurred at any time from age 
3 to age 18. 

The lack of long-term predictive validity of infant tests needs to be evaluated 
further with regard to other related findings. First, predictions may be improved 
by considering developmental trends through repeated testing. Second, several 
investigators have found that infant tests have much higher predictive validity 
within nonnormal, clinical populations than within normal populations. Signifi- 
cant validity coefficients in the .60s and .70s have been reported for children with 
initial IQs below 80, as well as for groups having known or suspected neurological 
abnormalities (Ireton, Thwing, & Gravem, 1970; Knobloch & Pasamanick, 1963, 
1966; Werner, Honzik, & Smith, 1968). Infant tests appear to be most useful as 
aids in the diagnosis of defective development resulting from organic pathology of 
either hereditary or environmental origin. 

In the absence of organic pathology, the child's subsequent development is de- 
termined largely by the environment in which he or she is reared. This the test 
cannot be expected to predict. In fact, parental education and other, more spe- 
cific characteristics of the home environment are better predictors of subsequent 
intellectual level than are infant test scores; and beyond 18 months, prediction is 
appreciably improved if test scores are combined with indices of familial socioeco- 
nomic status (Bayley, 1955; McCall et al., 1972; Pinneau, 1961; Werner et al., 
1968). It has also been proposed that individual differences in infancy may be rel- 
atively minor and transitory, since normal development is essentially species-gen- 
eral at this early stage (R. B. McCall, 1981). In subsequent years, individual 
differences widen, become increasingly more stable across age, and yield higher 
correlations with both genetic and environmental factors (Plomin, De Fries, & 
Fulker, 1988). Nevertheless, it should be noted that the 1990s have yielded in- 
creasing research activity on the predictive value of infant cognitive behavior, 
with promising results (Colombo, 1993). 


Nature of Early Childhood Intelligence. The validity of infant intelli- 
gence tests and the meaning of early performance measures can be understood 
more fully when viewed against research on the nature of early childhood 
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intelligence. The findings of such research fail to support the conception of a 
developmentally constant and unitary intellectual ability in infancy (Lewis, 1973, 
1976; McCall et al., 1972). Negligible correlations may be found over intervals 
even as short as three months; and correlations with performance on the same or 
different scales at the age of two years and beyond are usually insignificant. 
Moreovet, there is little correlation among different scales administered at the 
same age. These results have been obtained with both standardized instruments 
such as the Bayley Scales of Infant Development and with ordinal scales of the 
Piagetian type (Gottfried & Brody, 1975; King & Seegmiller, 1973; Lewis, 1976; 
Lewis & McGurk, 1972). 

Several investigators have concluded that, while lacking predictive validity, 
infant intelligence tests are valid indicators of the child’s cognitive abilities at the 
time (Bayley, 1970; Stott & Ball, 1965; Thomas, 1970). According to this view, a 
major reason for the negligible correlations between infant tests and subsequent 
performance is to be found in the changing nature and composition of intelli- 
gence with age. Intelligence in infancy is qualitatively different from intelligence 
at school age; it consists of a different combination of abilities. 

In a series of intensive studies, McCall and his co-workers explored the chang- 
ing nature of infant intelligence at six-month intervals during the first two years 
of life (R. B. McCall, 1976; McCall, Eichorn, & Hogarty, 1977; McCall et al., 
1972). Through statistical analyses of intercorrelations of different skills within 
each six-month level, as well as correlations across age between the same skills 
and between different skills, these researchers looked for precursors of later devel- 
opments in infant behavior. One conclusion that emerges from this research is 
that the predominant behavior at different ages exhibits qualitative changes that 
represent orderly and reasonable transitions. When the responses of infants to 
Gesell items were factor-analyzed separately at successive six-month age levels, 
the first-factor scores at each level correlated significantly across ages. Yet the be- 
havioral composition of these first factors varied from age to age. In other words, 
the specific manifestations of mental competence differed across age levels, al- 
though competence at one age predicted competence at later ages when each was 
assessed through age-appropriate behaviors. ; 

The concept of developmental transformations has been introduced to describe 
the changes in age-appropriate manifestations of intellectual competence: Fur- 
ther evidence of such qualitative changes in competence behavior is provided by 
the research of Yarrow and his associates on the infant’s mastery of her or his en- 
vironment (Messer et al., 1986; Morgan & Harmon, 1984; Yarrow et al., 1983; 
Yarrow et al., 1984; Yarrow & Messer, 1983). The results showed a temporal pro- 
gtession in both the tasks that elicited this behavior and the specific kind of be- 
havior elicited, such as visual regard, manipulation, and persistence in problem 


ronment, as in dropping a block to see it fall and hear it 
a bell to make it ring. Later, environmental mastery 


plex goal-directed activities, : 
obtain a toy. Through the identification of such specific age-appropriate behav- 


iors, both the construct validity and the predictive validity of ear 
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assessments of intelligence can be more effectively investigated. It is also impor- 
tant to consider the role of content knowledge in the exercise of intellectual 
processes and cognitive strategies (Reese, 1987), a fact that is being increasingly 
recognized in research on cognitive psychology as a whole. 


Implications for Intervention Programs. The demonstrated effectiveness 
of the various intervention programs of the Head Start era depends on the quality 
of the particular program (R. C. Collins, 1993; Haskins, 1989; Zigler & Muenchow, 
1992; Zigler & Styfco, 1993). Designed principally to enhance the academic 
readiness of children from disadvantaged backgrounds, these programs varied 
widely in procedures and results. Most were crash projects initiated. with 
inadequate planning for either implementation or evaluation. Only a few could 
demonstrate substantial improvements in children's performance—and such 
improvements were often limited and short-lived. In contrast to most programs, 
which sought only to "raise the IQ” through vaguely chosen procedures, a few 
high-quality projects clearly defined the specific intellectual skills to be improved 
and selected appropriate training procedures. In such cases, carefully conducted 
follow-ups did show significant and lasting improvements in the relevant skills. 
Some attention was also given to the broader context in which the program was 
introduced, which might include needed health care and social services in the 
home. Parental involvement proved especially valuable to supplement the 
preschool program at home and to ensure its continuation after the official 
project terminated (Jaynes & Wlodkowski, 1990). 

Follow-up programs to assess the nature and duration of the intervention also 
require close attention. Evaluating the effectiveness of such projects calls for con- 
siderable methodological sophistication (Collins & Horn, 1991; Willett & Sayer, 
1994). Apart from questions of experimental design, statistical artifacts associated 
with the psychometric properties of the assessment instruments may yield spuri- 
ous positive or negative results (Bejar, 1980). Differences in either the difficulty 
or the discriminative value of test items between treated and control groups, or 
between pretest and posttest performance of the same group, may lead to incor- 
rect conclusions regarding the success or failure of the program. The use of tests 
developed and scored on the basis of item response theory (IRT, chap. 7) and in- 
dividually tailored to the examinees (CAT, chap. 10) avoids some of these diffi- 
culties. In recent years, there has been a revival of interest in the establishment of 
properly designed and carefully executed intervention programs, modeled on 
some of the more effective earlier programs (R. C. Collins, 1993; Consortium, 
1983; Haskins, 1989; Whimbey, 1990; Zigler & Styfco, 1993). Moreover, the de- 
velopment of the newer programs can benefit from the rapidly growing database 
from research on child intelligence (see, e.g., Horowitz & O'Brien, 1989). A par- 
ticularly promising long-term project concentrates on the effect of parental be- 
havior toward 1- and 2-year-old children on the children’s subsequent intellectual 
performance (Hart & Risley, 1995). The initial data already provide strong sup- 
port for a close relation between nature and extent of parental contacts and the 
children’s intellectual development. 
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PROBLEMS IN THE TESTING OF ADULT INTELLIGENCE 


Age Decrement. A distinctive feature introduced by the Wechsler scales for 
measuring adult intelligence (chap. 8) was the use of a declining norm to 
compute deviation IQs. Raw scores on the WAIS (and WAIS-R) subtests are first 
transmuted into standard scores with a mean of 10 and an SD of 3. These scaled 
scores are expressed in terms of fixed reference groups consisting of 500 persons 
between the ages of 20 and 34 years included in the standardization samples. The 
sum of the scaled scores on the 11 subtests is used in finding the deviation IQ in 
the appropriate age table. If we examine the sums of the scaled scores directly, 
however, we can compare the performance of different age groups in terms of a 
single, continuous scale. Figure 12-1 shows the means of these total scaled scores 
for the age levels included in the national standardization sample. 
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Figure 12-1. Decline in WAIS-R Full Scale Scaled Scores Across Age Cohorts. 


(Data from Wechsler, 1981, p. 26.) 
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As can be seen in Figure 12—1, the scores reach a peak between the ages of 20 
and 34 and then decline steadily in all older age groups. The deviation IQ is 
found by referring an individual's total scaled score to the norm for her or his own 
age group. Thus, if the test taker shows the same decline in performance with age 
as does the normative sample, her or his IQ should remain constant. Wechsler's 
original rationale for this procedure was that it is "normal" for a person's tested 
ability to decline with age beyond the 30s. 

In the interpretation of the age decline illustrated in Figure 12-1, however, we 
must consider an essential feature of the samples employed in test standardiza- 
tion. Because every standardization sample is a normative sample, it should reflect 
existing population characteristics at each age level (Anastasi, 1956). It follows 
that, when the educational level of the general population has been rising over 
several decades, older groups at any one point in time will have received less edu- 
cation than have younger groups. This educational difference is clearly reflected 
in the standardization samples of both the WAIS (tested between 1953 and 1954) 
and the WAIS-R (tested between 1976 and 1980). In both samples, the maxi- 
mum years of schooling are found in the 20- to 34-year levels, and educational 
level drops consistently in the older groups. Although the later, WAIS-R sample 
received much more education as a group than did the WAIS sample, the age de- 
cline in years of schooling is just as pronounced within the later as it is within the 
earlier standardization sample. And the corresponding age decline in WAIS- 
R-scaled scores closely parallels that found with the earlier WAIS. 

Age differences in amount of education are inevitable if a test standardization 
sample is to be truly representative of the population of the country at the time 
the norms are established. Nevertheless, the educational differences complicate 
the interpretation of the observed score decrements. The older groups in the stan- 
dardization sample may have performed more poorly on the test not because they 
were growing old but because they had received less education than the younger 
groups. 

The results obtained with the standardization samples of the Wechsler scales 
are typical of the findings of traditional cross-sectional studies of adult intelli- 
gence. Cross-sectional comparisons, in which persons of different ages are 
examined at the same time, are likely to show an apparent age decrement because 
cultural changes are confounded with the effects of aging. Amount of formal edu- 
cation is only one of many variables in which age groups may differ. Other cul- 
tural changes have occurred in human societies during the past half century that 
make the experiential backgrounds of 20-year-olds and 70-year-olds quite dissimi- 
lar. Certainly changes in communication media, such as radio, television, and in- 
ternet, and in transportation facilities have greatly increased the range of 
information available to the developing individual. Improvements in nutrition 
and medical care would also indirectly influence behavior development. 

Longitudinal studies, based on retests of the same persons over periods of 5 to 
40 years, have generally revealed the opposite trend, the scores tending to im- 
prove with age. Several of these longitudinal investigations have been conducted 
with intellectually superior groups, such as college graduates or individuals ini- 
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tially chosen because of high IQs (Bayley & Oden, 1955; R. B. Burns, 1966; D. P. 
Campbell, 1965; Nisbet; 1957; Owens, 1953, 1966). For this reason, some writers 
have argued that the findings may be restricted to persons in the higher intellec- 
tual or educational levels and do not apply to the general population. However, 
similar results have been obtained in other longitudinal studies with intermediate 
samples (Charles & James, 1964; Eisdorfer, 1963; Tuddenham, Blumenkrantz, & 
Wilkin, 1968), as well as with mentally retarded adults outside of institutions 
(Baller, Charles, & Miller, 1967; Bell & Zubek, 1960; Charles, 1953). 

Neither cross-sectional nor longitudinal studies alone can provide a conclusive 
interpretation of observed age changes. On the one hand, age differences in edu- 
cational level can produce a spurious age decrement in test performance in cross- 
sectional studies. On the other hand, as persons grow older, they are themselves 
exposed to cultural changes that may improve their performance on intelligence 
tests. Several excellent analyses of the methodological. difficulties inherent in 
each procedure, together with the required experimental designs, have been pub- 
lished.! Basically, what is needed is the combination of several procedures, as in 
the cross-sequential design (K. W. Schaie, 1965, 1994; Shock et al., 1984). This de- 
sign combines data from traditional cross-sectional and longitudinal testing with 
time-lag comparisons. The latter requires the testing of same-age cohorts” at differ- 
ent time periods. For instance, 20-year-olds tested in 1940 are compared with 20- 
year-olds tested in 1970. 

A few studies provide data that permit at least a partial analysis of the con- 
tributing factors. Owens (1966) in his 40-year retest of Iowa State University stu- 
dents and D. P. Campbell (1965) in his 25-year retest of University of Minnesota 
students also tested present freshmen in the respective colleges. Thus, multiple 


the same age 25 to 40 years apart, and the performance of a single group tested be- 
fore and after the same time intervals. In both studies, the initial group improv 


other experiential factors, rather than age per se, that produce both the rises and 
declines in scores obtained with the more limited experimental designs. The in- 
creasing interest in research on learning by older adults is evident in the compre- 
hensive survey by Kausler (1994). The book certainly helps to dispel stereotypes 


The Seattle Longitudinal Study (SLS). A particularly well planned, long-term 
research program using the cross-sequential design is the Seattle Longitudinal Study 
(K. W. Schaie, 1994; Schaie & Hertzog, 1986). Beginning in 1956, the investigators 
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1See, for example, Baltes-(1968), Botwinik (1984, chaps. 20, 21), Buss (1973), Nesselroade and Reese (1973), 
Nesselroade and Von Eye (1985), K. W. Schaie (1973, 1988a), and Schaie and Hertzog (1986). 

2As used in this context, a cohort is a group of persons of the same age, that is, born in the same year or other 
specified period. 
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administered a battery of ability tests? to a stratified random sample of 500 
persons, drawn from a population of approximately 18,000: members of a prepaid 
medical plan. This population is described as fairly representative of the census 
figures for a large metropolitan area. The sample included 25 men and 25 women 
at each five-year interval from 21 to 70. In successive stages of the SLS, the tests 
were adr .:aistered over six cycles (from 1956 to 1991). Included in each cycle 
were avart: de survivors of the original sample, as well as new samples introduced 
in each period. 

The basic experimental design of the SLS included repeated longitudinal 
retests of the same persons, cross-sectional comparisons of different age cohorts 
tested at the same time (e.g., 30-year-olds compared with 50-year-olds tested in 
1977), and comparisons of specific age cohorts tested at different times (e.g., 30- 
year-olds in 1963 and 30-year-olds in 1984). Through appropriate statistical 
analyses of data from the various types of comparisons, it was possible to identify 
performance changes associated with age, those associated with cultural changes 
in the particular society, and those associated with individual practice and experi- 
ence. As illustrated in Figure 12-2, the results showed that, in most functions, age 
decline begins later and tends to be less steep than suggested by the traditional 
cross-sectional comparisons. 

In addition to a comprehensive and systematic approach to the long-disputed 
effect of age on performance, the SLS investigated several other related questions. 
For instance, it was found that age changes vary with the intellectual function as- 
sessed, such as verbal ability, numerical ability, and perceptual speed (see Figure 
12-2). Hence, an overall measure of intelligence (such as an IQ) would have con- 
fused and obscured both rises and declines in ability. Considerable research was 
also concerned with the causes of change, especially the performance declines 
found for the more elderly persons. Among the principal variables associated with 
declining performance were poor health status, specific diseases, inactivity, lack of 
continued practice in particular functions, and such personality conditions as low- 
ered motivation and decreased flexibility. As a result of such findings, intervention 
procedures were developed in an effort to halt or reverse the observed ability de- 
clines with age. Current SLS research is especially directed to the development of 
such remedial programs (see K. W. Schaie, 1994; Schaie & Hertzog, 1986). 


Individual Differences and Age. In addition to the basic findings that age 
decrements are smaller and occur later in life than was formerly supposed, current 
research generally reveals wide individual differences in ability at all age levels. 
Any generalization, whether pertaining to age decrement or cohort differences, 
must be qualified by a recognition of the wide individual variability found in all 
situations. Individual differences within any one age level are much greater than 


“Initially, the tests included the Primary Mental Abilities battery developed by Thurstone through factor analy- 
sis of generally recognized aspects of intelligence (see chap. 11). A later version, the Schaie-Thurstone Adult 
Mental Abilities Test, was developed for use in this project (K. W. Schaie, 1988b). Eventually scores were re- 
ported for constructs rather than for single tests, in order to provide more stable and generalizable results. 


Psychological Issues in Ability Testing 349 


60 > 
55:77] 
50:7] 
8 rsur 
8 457 | 
a ar 
S 8 
m doc —[— Inductive Reasoning er 
E —6-— Spatial Orientation » m 
-..o-- Perceptual Speed "e 
357] —fA- Numeric Ability ` ` 
-g- Verbal Ability. — " : 
Sex --X- Verbal Memory 


S en Age | 


Figure 12-2. Longitudinal Estimates of Mean Factor Scores for Ability 
Constructs. (From K. W. Schaie; 1994, p. 308.) Note: Data accumulated from 7-year 
within-person retests. 


(Copyright © 1994 by the American Psychological Association. Reproduced by permission.) 


the average difference between age levels. As a result, the distributions of scores 
obtained by persons of different ages overlap extensively. This simply means that 
large numbers of older persons can be found whose performance equals that of 
younger persons. Moreover, the best performers within the older groups excel the 
poorest performers within the younger groups. Nor is such overlapping limited to 
adjacent age levels; the ranges of performance still overlap when extreme groups 
are compared. Thus some 80-year-olds will do better than some 20-year-olds. 
What is even more relevant to the present topic, however, is that the changes 
that occur with aging vary with the individual. Thus, between the ages of 50 and 
60, for example, some persons may show a decrease, some no appreciable change, 
and some an inctéasé in test performance: The amount of change, whether it be a 
drop or a rise, will also vary widely among individuals. Moreover, intensive stud- 
ies of persons of advanced age; extending into the seventh, eighth, and ninth 


350 ABILITY TESTING 


decades of life, indicate that intellectual functioning is more closely related to the 
individual's health status than to chronological age (Birren et al., 1983; Palmore, 
1970; Schaie & Gribbin, 1975). Other contributing factors are a favorable envi- 
ronment, with varied opportunities for intellectual stimulation, and the mainte- 
nance of a flexible lifestyle (K. W. Schaie, 1994; Schaie & Hertzog, 1986). 


Nature of Adult Intelligence. Within the life span, traditional intelligence 
testing has been oriented chiefly toward the schoolchild and the college student. 
At these levels, the test constructor can draw on the large common pool of 
experiences that have been organized into academic curricula. Most intelligence 
tests measure how well the individual has acquired the intellectual skills taught in 
the schools; and they can in turn predict how well the student is prepared for the 
next level in the educational ladder. Tests for adults, including the Wechsler 
scales, as well as the PMA tests used in the Seattle Longitudinal Study, draw 
largely on this identifiable common fund of experience. As the individual grows 
older and her or his own formal educational experiences recede farther into the 
past, this fund of common experience may become increasingly less appropriate 
to assess her or his intellectual functioning. Adult occupations are more 
diversified than childhood schooling. The cumulative experiences of adulthood 
may thus stimulate a differential development of abilities in different persons. 

Because intelligence tests are closely linked to academic abilities, it is not sur- 
prising to find that even the earliest studies of adults have shown larger age incre- 
ments in score among those individuals who have continued their education 
longer (D. P. Campbell, 1965; Hárnqvist, 1968; Husén, 1951; Lorge, 1945; 
Owens, 1953). Similarly, persons whose occupations are more “academic” in con- 
tent, calling into play verbal and numerical abilities, are likely to maintain their 
performance level or show improvement in intelligence test scores over the years, 
while those engaged in occupations emphasizing mechanical activities or inter- 
personal relations may show a loss. Some suggestive data in support of this hy- 
pothesis are reported by M. Williams (1960), who compared the performance of 
100 persons, ranging in age from 65 to over 90, on a series of verbal and nonver- 
bal tests. Rather striking correspondences were found between the individual's 
occupation and her or his relative performance on the two types of tasks. Longitu- 
dinal investigations of adults have also found suggestive relationships between 
total IQ changes and certain biographical inventory items (Charles & James, 
1964; Owens, 1966). 

Each time and place fosters the development of skills appropriate to its charac- 
teristic demands. Within the life span, these demands differ for the infant, the 
schoolchild, the adult in different occupations, and the retired septuagenarian 
(Baltes, Reese, & Lipsitt, 1980). An early demonstration of the implications of 
this fact for intelligence testing was provided by Demming and Pressey (1957). 
These investigators began with a task analysis of typical adult functions, con- 
ducted through informal surveys of reading matter and of reported daily activities 
and problems. On this basis, they prepared preliminary forms of some 20 tests "in- 
digenous” to the older years. The tests emphasized practical information, judg- 
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ment, and social perception. Results with three of these tests, administered to- 
gether with standard verbal and nonverbal tests to samples of different ages, 
showed that the older persons excelled the younger on the new tests while the re: 
verse relationship held for the traditional tests. All these types of research suggest 
that whether intelligence test scores rise or decline with increasing age in adult- 
hood depends largely on what experiences the individual undergoes during those 
years and on the relationship between these experiences and the functions cov- 
ered by the tests. Ata broader level, all test results can be best understood within 
a contextual framework. Task analyses of behavior in a diversity of adult settings, 
including occupational and other areas of socially significant activity, should 
both improve test construction and clarify the meaning of test scores (Anas- 
tasi,1986b). It is noteworthy that the 1980s showed and the 1990s show a con- 
spicuous increase in research concerned with human adults, especially with those 
at the upper end of the life span.* Life-span developmental psychology is a grow- 
ing research area (Rutter & Rutter, 1993). | 


POPULATION CHANGES IN INTELLIGENCE TEST 
PERFORMANCE 


Rising Scores. What ‘happens to the intelligence test performance of a 
population over long time periods? This is a question we have already encoun- 
tered in several connections, In the preceding section, it was seen that, as the 
educational level of the adult population increased over the decades, mean intel- 
ligence test performance rose accordingly. As a result, the older members of the 
normative sample, who had on the average completed less 'education than had 
the younger members, scored lower on the test than did the younger. A similar 
phenomenon is found in the standardization sample of tests for children. In the 
successive restandardizations of the Stanford-Binet and the WISC, the later 
standardization samples performed substantially better than did the earlier sam- 
ples. The result was that any one child would receive a lower IQ if tested with the 
revised edition than he or she would on the earlier edition, simply because her or 
his performance was evaluated against higher norms. The higher educational 
level of the parents of children tested in the later sample was one of the 
conditions mentioned for this rise in tested intelligence. 

This type of comparison may be designated the longitudinal study of popula- 
tions. The usual application of the longitudinal method in psychological research 
involves the repeated testing of the same individuals over time. In the longitudi- 
nal study of populations, however, the population is sampled at different time pe- 
tiods. The comparison is between cohorts of persons born at different times but 
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"See, for example, Bengtson and Schaie (1989), Birren and Bengtson (1988), Birren and Schaie (1991), Craik 


and Salthouse (1992), Fiske and Chiriboga (1990), Haplip and Panek (1993), Kausler (1994), Nadien (1989), 
Sonderegger (1992), Schulz and Ewen (1993); and Willis and Schaie (1986). 


352 ABILITY TESTING 


tested at the same ages.” Several large-scale investigations conducted during the 
first five decades of the twentieth century revealed a rising intelligence in the 
population as measured by standardized intelligence tests (Anastasi, 1985d, pp. 
126-130). With increasing literacy, higher educational levels, and other cultural 
changes, it was evident that the mean tested intelligence of the general popula- 
tion of all ages showed a steady rise for several decades. 

Various procedures have been employed in these comparative studies. One 
procedure is to administer the identical test after a lapse of time, as was done in 
surveys of 11-year-old Scottish children in 1932 and 1947 (Scottish Council, 
1949). Another procedure is to give two tests to a representative sample of per- 
sons in order to establish the correspondence between the two sets of scores and 
thereby, “translate” performance from one test to the other. This was done in a 
comparison of the performance of soldiers in the U.S. Army in World Wars I and 
II, who had been examined with the Army Alpha and the Army General Classifi- 
cation Test, respectively (Tuddenham, 1948). A third and technically more 
sound approach is based on the establishment of an absolute, sample-free score 
scale through the use of anchor items, as was done with the College Board tests. 
The application of item response theory (chap. 7) represents a further refinement 
of this approach. 


Declining Scores. | Whether the intelligence test scores of a given population 
rise, decline, or remain stable over time depends on many conditions. The time 
period covered, with its concomitant cultural changes, is clearly a prime factor. 
The age of the persons examined also makes a difference. For instance, a rising 
educational level of the population will directly affect the test performance of 
adults; but it will only indirectly influence children’s performance, since the 
children in the samples compared have had the same amount of education when 
tested. Another important consideration, particularly when examining a selected 
subpopulation, is any change in the degree of selection at different time periods. For 
example, if a larger proportion of the population attended high school in 1960 
than in 1910, as was undoubtedly the case, then the 1910 high school students 
represent a more highly selected sample of the general population of their own 
time than do the 1960 high school students. Apparent inconsistencies between 
rising and declining scores may arise from the nature of the tests employed, the 
particular subpopulations tested Leg, college-bound high school seniors, general 
adult population, elementary school children), or the specific time periods 
covered (e.g., Flynn, 1984, 1987). 

The number and complexity of conditions that may account for a population 
rise or decline in tested intelligence are well illustrated by an analysis of the 
highly publicized score decline on the College Board's Scholastic Aptitude Test, 
or SAT (Donlon, 1984, pp. 188-191; Wirtz, 1977). Between 1963 and 1977, the 
mean SAT Verbal score fell from 478 to 429, and the mean SAT Mathematical 
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score fell from 502 to 470. In an effort to understand this steady 14-year score de- 
cline, a specially appointed panel commissioned 38 studies by experts in various 
areas and considered an impressive array of causal hypotheses. 

A major conclusion reached by the panel was that the causal pattern varied 
from the first to the second half of the 14-year period. During the first 7 years, the 
score decline resulted predominantly from a compositional change in the group 
taking the SAT. Because of a continuing increase in the proportion of high school 
graduates going to college over this period, the sampling became progressively less 
selected in the cognitive skills measured by the test. During the second 7 years, 
however, the college-going population had become stabilized, and sampling selec- 
tion accounted for a much smaller portion of the score decline. For this period, 
the explanation had to be sought principally in conditions in the home, the 
school, and society at large. The panel observed that the available data do not 
permit a determination of the relative contribution of different cultural changes 
to the score decline. Among the many factors cited as probably significant, how- 
ever, were a diminished emphasis on academic standards, grade inflation and 
automatic promotions, reduced homework assignment, increased school ab- 
senteeism, diminished attention to mastery of skills and knowledge, excessive TV 
watching, and the social upheavals of the period which competed for attention in 
the lives of students. 

A. subsequent. analysis (Turnbull, 1985) suggested that the decrease in SAT 
scores during the second 7-year period may have been in part a delayed and indi- 
rect consequence of the compositional change of the first 7 years. Because a larger 
proportion of poorly prepared students remained in high school (and applied for 
college admission), many of the changes in schooling conditions may be seen as 
adaptive responses of the schools to the greater diversity of their students. These 
responses, which tended to lower the demand level of the school program, are il- 
lustrated by grade inflation, proliferation of elective courses in vocational and 
general areas, simplification. of textbooks, and reduction of homework assign- 
ments. Thus the changes in student populations led to educational program 
changes, which in turn led to further score decline. This hypothesis is consistent 
with the finding that high scores were fewer in the 1970s. The scores of students 
ranking high in their class continued to drop, while those of the lowest-ranking 
students stabilized or improved. Moreover, in the late 1970s and the 1980s, there 
was evidence of a turning point in the academic demand level of the schools, 
which was reflected in an upturn in SAT scores. 


Overview. The methodological problems encountered in efforts to assess 
population changes are highlighted by an attempted survey of reported gains in 
intelligence test performance in 14 nations (Flynn, 1987). The findings revealed 
so many inconsistencies and incompatibilities, and their explanations were so 
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elusive, as to lead the author to draw no conclusion save the negative proposal 
that intelligence tests do not actually measure intelligence! Efforts to measure 
population changes are still under investigation; and the optimal procedure is yet 
to be fully developed. An excellent treatment of relevant procedural questions 
can be found in the report of a conference on the subject, edited by Collins and 
Horn (1991). 

For the present, proper understanding of research findings on rises and declines 
of the test performance of populations requires several kinds of related informa- 
tion. First, the particular tests administered must be identified, with special atten- 
tion to their coverage of specific processes and content (e.g., perceptual speed, 
memory, verbal comprehension, spatial orientation), as well as the source of their 
norms. Second, the dates when testing and retesting occurred must be reported. 
Third, relevant data must be included about the populations that were sampled, 
as well as any sampling changes in retesting, such as loss of initially better or 
poorer performers in successive samples. Fourth, information should be available 
about any significant cultural changes affecting the population under study; these 
could refer to amount and nature of education, developments in communication 
or transportation that might affect intercultural contacts, or any other events in 
the experiential history of individuals that could alter their intellectual progress 
in either amount or direction. 

The possible confusions that may disrupt surveys of changes in different popu- 
lations also include the popular misuse of the tetm "intelligence" (and especially 
"IQ") as though it represented a unitary, identifiable property of the organism 
(see, e.g., Flynn, 1987). If, instead, intelligence is construed as that combination 
of abilities required for effective functioning and advancement in a particular 
context (see chap. 11), then the proper interpretation of either an individual's in- 
telligence test performance or the mean performance of populations at different 
times and places certainly requires knowledge about the major conditions cited 
above. There is increasing recognition of the many technical problems encoun- 
tered in the measurement of change and the diversity of procedures for assessing 


hes in different contexts and for different purposes (see especially, Gottman, 


CULTURAL DIVERSITY 


The use of tests with persons of diverse cultural backgrounds is examined from 
various angles in different parts of this book: Chapter 18 is concerned with social 
and ethical considerations in such testing, particularly with reference to minority 
groups within a broader national culture. Technical problems pertaining to test 
bias and item bias were analyzed in chapters 6 and 7. And in chapter 9 we consid- 
ered typical tests originally designed for various transcultural applications. In this 
section, we shall present some basic theoretical issues about the role of culture in 


behavior, with special reference to the use and interpretation of intelligence test 
scores. 
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The Field of Cultural Psychology. The last three decades of the twentieth 
century witnessed a conspicuous growth in research and writing on cultural psy- 
chology (Berman, 1990; Irvine & Berry, 1988). There were even international 
conferences devoted primarily to the topic (see, e.g., Brislin, 1993; Cronbach & 
Drenth, 1972; Manoleas, 1995). This field concerns essentially behavioral differ- 
ences among groups reared and functioning in identifiably diverse cultural con- 
texts. Such contexts could be as narrow as a neighborhood or village, or as wide as 
a nation or continent. Moreover, some broadly identified cultures, such as 
Hispanics in the United States, have subcultures, such as Mexican, Cuban, 
Puerto Rican, and Central and South American, which are sufficiently diverse to 
need to be identified as such in understanding individual behavior (see, e.g., 
Geisinger, 1992; Marín & Marín, 1991). 

The role of culture in human behavior can be conceived as a form of domain 
specificity, similar to that recognized in cognitive psychology, as discussed in chap- 
ter 11. Beginning with the analysis of basic psychological processes, such as learn- 
ing, remembering, problem solving, and emotion, cognitive psychologists soon 
discovered that such processes are manifested in behavior that is domain-specific. 
For example, memory—or problem solving, or reasoning—may vary widely when 
one is playing chess, working on mathematical problems, or writing an essay. 

Cultural psychology began with the study of behavior in widely diverse and 
previously unfamiliar cultures that were conspicuously different from the investi- 
gator's own culture. It rapidly evolved into a systematic exploration of the differ- 
ent experiential histories of persons reared in different cultures. Essentially, the 
current field of cultural psychology represents a recognition of the cultural speci- 
ficity of all human behavior, whereby basic psychological processes may result in 
highly diverse performance, attitudes, self-concepts, and world views in members 
of different cultural populations (L. L. Adler & Gielen, 1994; Berry, Poortinga, 
Segall, & Dosen, 1992; Diaz-Guertero, 1990; Shweder & Sullivan, 1993). The 
contribution of culture is being increasingly recognized and integrated in all fields 
of psychology, from research and theory on life-span development, social behav- 
ior, emotion, or thinking,’ on the one hand, to the practice of industrial-organiza- 
tional, clinical, or counseling psychology, on the other. 

The growing awareness of the role of cultural factors in all fields of psychology 
was demonstrated in the program of the 1994 annual convention of the American 
Psychological Association (APA, 1994). In addition to a full-day Continuing Edu- 
cation Workshop on “Cultural Sensitivity in Assessment and Intervention, the 
annual Master Lectures were devoted to the general topic of "International Per- 
spéctives on Cross-Cultural Psychology." In this series, six internationally recog- 
nized psychologists discussed different aspects of the topic. These lectures were 
themselves a part of the Topical Miniconvention on "Cultural Diversity: The 


7 : i d Marcus (1994), Mistry and Rogoff (1985), Nu: 
See, for example; Gormly and Brodzinsky (1993), Kitayama AOp (el, Smith and Bond (1993), and 


gent, Lester, and Brazelton (1991), Rogoft (1990); Rogoff and Chavajay 
Topping, Crowell, and Kobayashi (1989). 

*See, for example, Freilich, Raybeck, and Savishinsl 
Triandis, Dunnette, and Hough (1994). 


ky (1991), Pedersen (1987), Pedersen and Ivey (1993), and 
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Future of America," which was distributed over the full five-day APA convention. 
Further indication of the rapidly growing cross-cultural orientation of psychology 
is provided by the establishment of a new journal, Culture and Psychology (1995). 


Cultural Differences Versus Cultural Handicap. When psychologists 
began to develop instruments for cross-cultural testing in the first quarter of the 
twentieth century, they hoped it would be at least theoretically possible to 
measure "hereditary intellectual potential" independently of the impact of cultur- 
al experiences. The individual's behavior was thought to be overlaid with a sort of 
cultural veneer whose penetration became the objective of what were then called 
“culture-free” tests. Subsequent developments in genetics and psychology have 
demonstrated the fallacy of this concept. We now recognize that hereditary and 
environmental factors operate jointly at all stages in the organism’s development 
and that their effects are inextricably intertwined in the resulting behavior. For 
humans, culture permeates nearly all environmental contacts. Since all behavior 
is thus affected by the cultural milieu in which the individual is reared and since 
psychological tests are but samples of behavior, cultural influences will and should 
be reflected in test performance. It is therefore futile to try to devise a test that is 
free from cultural influences. Later, the objective was to construct tests that pre- 
suppose only experiences that are common to different cultures. For this reason, 
such terms as "culture-common," “culture-fair,” and “cross-cultural” replaced the 
earlier “culture-free” label. 

Nevertheless, no single test can be universally applicable or equally "fair" to all 
cultures. There are as many varieties of culture-fair tests as there are parameters in 


not interchangeable but are useful in different types of cross-cultural comparisons. 
It is unlikely, moreover, that any test can be equally “fair” to more than one cul- 
tural group, especially if the cultures are quite dissimilar, While reducing cultural 
differentials in test performance, cross-cultural tests cannot completely eliminate 
such differentials. Every test tends to favor persons from the culture in which it 
was developed. The mere use of Paper and pencil or the presentation of abstract 
tasks having no immediate Practical significance will favor some cultural groups 
and handicap others. Emotional and motivational factors likewise influence test 
performance. Among the many relevant conditions differing from culture to.cul- 
ture may be mentioned the intrinsic interest of the test content, rapport with the 
examiner, drive to do well ona test, desire to excel others, and past habits of solv- 
ing problems individually or cooperatively. 
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A related concept is that of cultural deprivation. Although this term has been 
used in many different senses, Feuerstein (1980, 1991; Feuerstein & Feuerstein, 
1991) gave the concept a special meaning and made it a focal point in his cogni- 
tive training program. He regarded cultural deprivation as a state of reduced cog- 
nitive modifiability, produced by a lack of mediated learning experience. The 
transmission of the accumulated knowledge of the culture from one generation to 
the next is a distinctly human occurrence. In this process, the parent or other 
caregiver acts as a mediating agent in selecting and organizing the stimuli encoun- 
tered by the child. Feuerstein considers such mediated learning essential for the 
child’s cognitive development, insofar as it fosters the establishment of learning 
sets, orientations, and other behavior patterns that facilitate subsequent learning. 
Children who, for whatever reason, have failed to experience such mediated 
learning lack the prerequisites for high-level cognitive functioning. By contrast, 
those who have had mediated learning experiences within their own culture have 
developed the prerequisite skills and habits for continued modifiability, and they 
can adapt to the demands of a new culture after a relatively brief transition pe- 
tiod. It is also likely that, in developing nations, psychologists who are members 
of the indigenous culture will eventually develop and use tests appropriate for 
that culture. 

From a different point of view, the mere existence of cultural stereotypes may di- 
rectly affect an individual’s test performance (Steele, Spencer, & Aronson, 1995). 
A long-term research program found evidence that knowledge about existing 
stereotypes may affect some test-takers in their motivation and attitudes toward 
the test through distraction, self-concept, reduced effort, and low expectation of 
successful performance. This reaction has been called stereotype vulnerability and 
has been found to influence test scores in both gender and ethnic comparisons. To 
be effective, affirmative action programs need more than special remedial train- 
ing. In at least’ some individual cases, the performance expectancy in specific 
areas, such as verbal, mathematical, or spatial, may require special attention, to 
prevent failure by persons who would otherwise have performed at a normal or su- 
perior level. 


Language in Transcultural Testing. Most traditional cross-cultural tests 
utilized nonverbal content in the hope of obtaining a more nearly culture-fair 
measure of the same intellectual functions measured by verbal intelligence tests 
(see chap. 9). Both assumptions underlying this approach are questionable. First, 
it cannot be assumed that nonverbal tests measure the same functions as verbal 
tests, however similar they may appear. A spatial analogies test is not merely a 
nonverbal version of a verbal analogies test. Some of the early nonlanguage tests, 
such as the Army Beta, were heavily loaded with spatial visualization and 
perceptual abilities, which are quite unrelated to verbal and numerical abilities. 
Even in tests like Raven’s Progressive Matrices and other nonlanguage tests 


deliberately designed to tap reasoning and abstract conceptualization, factorial 


analyses have revealed a large contribution of nonverbal factors to the variance of 
test scores (e.g., R. S. Das, 1963). Such findings are corroborated’ by more recent 
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research in cognitive psychology, which has repeatedly demonstrated the content 
specificity of thinking processes. Problem-solving strategies and skills are deve]. 
oped while responding to particular content domains and in particular contexts 
(see chap. 11). 


performance and other nonverbal tests than in verbal tests (Irvine, 19694, 1969b; 
Jensen, 1968; Ortar, 1963, 1972; Trimble, Lonner, & Boucher, 1983; Vernon, 


the original; it simply presents certain cues which, as a result of Past experience, 
lead to the Perception of the object. If the cues are highly reduced, as in a picture 
of a head to represent a whole person, and if the necessary past experience is ab- 
sent, the correct Perception may not follow. There is now a considerable body of 
empirical data indicating marked differences in the perception of pictures by per- 
Do Si different cultures (R. J. Miller, 1973; Segall, Campbell, & Herskovits, 
1966). 

From still another angle, nonverbal, spatial-perceptual tests frequently require 
relatively abstract thinking processes and analytic cognitive styles characteristic 
of middle-class Western cultures (Berry, 1972; R. A. Cohen, 1969). Persons 
reared in other cultural contexts may be much less accustomed to such problem- 
Solving approaches. Cultures differ in the value they place on generalization and 


e Testing Standards (AERA, APA, NCME, 1985, chap. 13) clearly specified 
that the reliability, validity, and norms of the translated version should be inde- 
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sample. In an English-speaking culture, for example, inadequate mastery of Eng- 
lish may handicap a child not only on an intelligence test but also in schoolwork, 
contacts with fellow students, and play activities; and they might thus interfere 
with subsequent intellectual and emotional development. In an adult, it could se- 
riously restrict job performance, interpersonal relations, and other significant life 
activities. Many other similar examples of such cultural differentials can be cited. 
Some are cognitive differentials, such as reading disabilities or ineffective strate- 
gies for solving abstract problems; others are attitudinal or motivational, such as 
lack of interest in intellectual activities, hostility toward authority figures, low 
achievement drive, or poor self-concept. All such conditions can be ameliorated 
by several means, ranging from language instruction and functional literacy train- 
ing to personal counseling and psychotherapy. All are likely to affect both test 
performance and daily life activities of child and adult. 

The importance of language in both test performance and everyday life activi- 
ties has led to the development of language proficiency tests in the original and 
the adopted language. Most of the currently available tests in the United States 
deal with English and Spanish: A list of sample tests, as well as references to the 
research, development, and evaluation of such tests, can be found in Duran 
(1989, pp. 574-577). For a broad treatment of bilingualism, see de Groot and 
Barry (1993). 


The Testing Situation. The rapidly expanding transcultural. contacts in 
today's world increase the probability of tests being administered to persons from 
different cultures. Every examiner can anticipate testing one or more persons 
from a culture other than her or his own. Hence the training of test examiners 
should include some knowledge of one or more dissimilar cultures, with special 
attention to the likely cultural effects on the behavioral development of individu- 
als, Even more important are the probable effects of such differences on the 
examinee’s response to testing. Examples of some broad sources of differential 
testing behavior are variations in the person’s self-concept, world view, degree of 
self-disclosure, and habits of solving problems singly or in groups. 

As was demonstrated in earlier parts of this chapter, to remove from a test 
those parts judged to be especially difficult because of the individual’s cultural 
background would only reduce its predictive validity and fail to alert the test 
taker to areas that need strengthening for effective performance in the antici- 
pated context. Accordingly, the traditional approach is being replaced by a solu- 
tion that shifts the focus on the examiner's behavior in the testing situation. 

The 1980s and 1990s have produced an array of handbooks, guidelines, journal 
articles, and other publications on the training and appropriate behavior of exam- 
iners in testing members of other cultures (Atkinson, Morten, & Sue, 1993; 
Myers, Wohlford, Guzman, & Echemendia, 1991; Stricker et al., 1990). Some 
deal specifically with testing students (Samuda, Kong, Cummins, Lewis, & Pas- 
cual-Leone, 1991); some focus on children (Miller-Jones, 1989; Rogoff & Morelli, 
1989); others focus on a particular cultural population or one of its subdivisions 
(Dana, 1984; Diaz-Guerrero & Szalay, 1991); but the majority deal with the gen- 
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eral problem of cross-cultural testing. These range from the very general and suc- 
cinctly stated guidelines published by the American Psychological Association as 
a short pamphlet in 1991 and reprinted in the 1993 American Psychologist 
(“Guidelines,” 1993), to a very thorough and detailed treatment in a book by 
Dana (1993). Both of these general sources should be carefully examined by any- 
one planning to administer tests, It is also relevant to note that a comprehensive 
discussion of the varieties, problems, and advantages of bilingualism can be found 


Essentially, the special role of the examiner in cross-cultural testing includes, 
first, obtaining full information, in a preliminary session with the examinee, re- 


taker. In this connection, the examiner must consider how to introduce the test, 
how to explain its Purpose, and how to motivate the examinee to perform appro- 
priately. Moreover, the examiner's own manner should establish favorable inter- 
personal relations with the test taker, Test-score interpretation and use should 
obviously consider cultural factors, as should the nature of the feedback and to 
Ri 
obtaining the necessary information from examinees. Some of these question- 


fent Personality 
Inventories 


Ithough the term “personality” is sometimes employed in a broader sense, 
in conventional psychometric terminology “personality tests’ 


chapters, we examine the major varieties of Personality tests. For the present pur- 
pose, the available instruments are classified according to method of obtaining 
data from individuals. This chapter deals with self-report personality inventories. 


'For more background on Current approaches to personali thi h. Aiken (1993), Burger 
(1993), and Maddi (1989). "H theory and research, see Aiken 
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through other techniques. For more detailed treatment of the topic, the reader is 
referred to these specialized books (e.g., Angleitner & Wiggins, 1986; Butcher, 
1995; Lanyon & Goodstein, 1997; and J. S. Wiggins, 1973/1988). 

In the development of personality inventories, several approaches have been 
followed in formulating, assembling, selecting, and grouping items. Among the 
major procedures in current use are those based on content relevance, empirical 
criterion keying, factor analysis, and personality theory. Each of these approaches 
will be discussed and illustrated in the following sections. It should be noted, 
however, that they are not alternative or mutually exclusive techniques. Theoret- 
ically, all can be combined in the development of a single personality inventory. 
In actual practice, most inventories now use two or more of these procedures. 

Although some personality tests are used as group screening instruments, the 
majority find their principal application in clinical and counseling settings. For 
this reason, the next four chapters need to be considered in conjunction with the 
perspectives of those specialized contexts presented in chapter 17. In their pre- 
sent state of development, most personality tests should be regarded either as aids 
in individual assessment or as research instruments. 


CONTENT-RELATED PROCEDURES 


The prototype of self-report personality inventories was the Woodworth Personal 
Data Sheet, developed for use during World War I (see references in chap. 2). 
This inventory was essentially an attempt to standardize a psychiatric interview 
and to adapt the procedure for mass testing. Accordingly, Woodworth gathered 
information regarding common neurotic and preneurotic symptoms from the psy- 
chiatric literature as well as through conferences with psychiatrists. It was in ref- 
erence to these symptoms that the inventory questions were originally 
formulated. The questions dealt with such deviant behaviors as abnormal fears or 
phobias, obsessions and compulsions, nightmares and other sleep disturbances, 
excessive fatigue and other psychosomatic symptoms, feelings of unreality, and 
motor disturbances such as tics and tremors. In the final selection of items, Wood- 
worth applied certain statistical checks, to be discussed in the next section. Nev- 
ertheless, it is apparent that the primary emphasis in the construction and use of 
this inventory was placed on the content relevance of its items, as indicated in 
the sources from which they were drawn as well as in the common recognition of 
certain kinds of behavior as maladaptive. The legacy of Woodworth to modern 
personality inventory construction is apparent in the fact that when pools of new 
items are developed, the instructions to item writers are still generally based on a 
content analysis of the behavior area to be assessed. 

A modern example of the content-related approach to self-report inventory 
development. is the, Symptom Checklist-90-Revised (SCL-90-R—Derogatis, 
1994). The SCL-90-R was designed to screen for psychological problems and 
symptoms of psychopathology. It consists of 90 symptoms, described briefly Leg, 
Poor appetite, Faintness or dizziness). Respondents are asked to indicate, using a 
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5-point scale, how much they have been distressed by each of the problems dur- 
ing the past seven days. The SCL-90-R provides separate norms for male and fe- 
male adult and adolescent nonpatients, as well as for psychiatric inpatients and 
outpatients. However, some of the norms are not sufficiently representative; for 
example, the psychiatric inpatient normative group was predominantly of lower 
socioeconomic status, and the adolescent group was mostly middle class and al- 
most totally White. 

The SCL-90-R items resemble those of earlier checklist-type instruments not 
only in that they were selected on the basis of content relevance and clinical util- 
ity, but also in that some of them can be traced back to the Woodworth Personal 
Data Sheet through intermediate scales like the Hopkins Symptom Checklist and 
the Cornell Medical Index ( Derogatis & Lazarus, 1994). The items are organized 
into nine dimensions of psychopathology, namely, Somatization, Depression, 
Anxiety, Hostility, Psychoticism, Interpersonal Sensitivity, Phobic Anxiety, Para- 
noid Ideation, and Obsessive-Compulsive symptoms. Factor analytic studies of 
these scales suggest that they are intercorrelated and thus not very useful in differ- 
ential diagnosis; nevertheless, the global indices derived from the Checklist have 
proved to be reliable indicators of the presence and severity of psychopathology 
(Payne, 1985). The SCL-90-R and related instruments, such as the Brief Symp- 
tom Inventory, can be used most appropriately as part of a battery, in the evalua- 
tion of change through a course of therapy, and in research on the outcome of 
various treatments. 
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abnormality. The method of contrasted groups was likewise employed in the 
selection of items. Only symptoms reported at least twice as often in a previously 
diagnosed psychoneurotic group than in a normal group were retained. 

Despite some use of such empirical chec's, however, content-related ap- 
proaches rely essentially on a literal or veridical interpretation of questionnaire 
items. The response to each question is regarded as an index of the actual pres- 
ence or absence of the specific problem, belief, or behavior described by the ques- 
tion. In empirical criterion keying, on the other hand, the responses are treated as 
diagnostic or symptomatic of the criterion behavior with which they were found 
to be associated. In an early description of this approach, Meehl (1945) wrote: 


... the verbal type of personality inventory is not most fruitfully seen as a “self- 
rating” or self-description whose value requires the assumption of accuracy on the 
part of the testee in his observations of self. Bag is the response to a test item 
taken as an intrinsically interesting segment of verbal behavior, knowledge regard- 
ing which may be of more value than any knowledge of the "factual" material about 
which the item superficially purports to inquire. Thus if a hypochondriac says that 
he had “many headaches” the fact of interest is that he says this. (p. 9) 


A self-report inventory is indubitably a series of standardized verbal stimuli. 
When criterion-keying procedures have been followed, the responses elicited by 
these stimuli are scored in terms of their empirically established behavior corre- 
lates. They are thus treated like any other psychological test responses. That ques- 
tionnaire responses may correspond to the person’s perception of reality does not 
alter this situation. It merely provides one hypothesis to account for the empiri- 
cally established validity of certain items. 


The Minnesota Multiphasic Personality Inventories 


The outstanding example of empirical criterion keying in personality test con- 
struction is the Minnesota Multiphasic Personality Inventory (MMPI). In recent 
years, the MMPI has been revised and reconstituted into two separate versions, 
the MMPI-2 (Butcher, Dahlstrom, Graham, Tellegen, & Kaemmer, 1989) and 
the MMPI-Adolescent (MMPI-A—Butcher et al., 1992). In spite of the exis- 
tence of these newer versions, no discussion of either can proceed without refer- 
ence to the original MMPI and the role it played in the history of personality 
assessment. Although a thorough description of it is beyond the scope of this text, 
it must be noted that for a period of almost half a century, the MMPI was the most 
widely used and the most thoroughly researched personality test. 


see earlier editions of this text (e.g., Anastasi, 1988b). 


5 1 I iprión of the original MMPI, 
For a fairly concise description of the origins in the classic volumes by Dahlstrom, Welsh, and 


More extensive treatments of this instrument can be found 
Dahlstrom (1972, 1975). 
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In many ways the MMPI, as an instrument, was a victim of its own success. 
Conceived in the 1930s by Starke R. Hathaway, a clinical psychologist, and 


was firmly entrenched as the leading personality test and used as frequently, or 
more, with normal subjects in counseling, employment, medical, military, and 
forensic settings, as with psychiatric patients. By the 1980s, the MMPI literature 


wise objectionable items, as well as the addition of new ones; the creation of sev- 
eral new validity, supplementary, and content scales; and the separation of the 
inventory into two versions, suitable for different age groups. 


The Minnesota Multiphasic Personality Inventory-2. The MMPI-2 
items consist of 567 affirmative Statements to which the test taker gives the 
Fesponses "True" or “False.” The first 370 items, which are Virtually identical to 
those in the MMPI except for editorial changes and teordering, provide 
all the responses needed to score the original 10 “clinical” and three “validity” 
scales. The Temaining 197 items (107 of which are new) are needed to score 


?The original articles have been reproduced in Dahlstrom and Dahlstrom (1980). 
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the full complement of 104 new, revised, and retained validity, content, and 
supplementary scales and subscales that make up the complete inventory, The 
items range widely in content, covering such areas as general health; affective, 
neurological, and motor symptoms; sexual, political, and social attitudes; 
educational, occupational, family, and marital questions; and many well-known 
neurotic or psychotic behavior manifestations, such as obsessive and compulsive 
states, delusions, hallucinations, ideas of reference, phobias, and sadistic and 
masochistic trends. Dahlstrom (1993a) has prepared a manual supplement that 
provides all the information necessary to compare the items of the MMPI-2 with 
the original ones. A few illustrative items, followed by the numbers they bear in 
the current form of the test, are shown below:* 


My sleep is fitful and disturbed. (39) 

I believe | am being plotted against. (138) 

I am worried about sex. (166) 

When I get bored I like to stir up some excitement. (169) 

Most people inwardly dislike putting themselves out to help other people. 
(286) 


The MMPI-2 provides scores on 10 basic “clinical scales,” which are the same 
as those in the original MMPI and are listed below: 


1. Hs: Hypochondriasis 6. Pa: Paranoia 
2. D: Depression 7. Pt: Psychasthenia 
3. Hy: Hysteria 8. Sc: Schizophrenia 


4. Pd: Psychopathic deviate 9. Ma: Mania 
5. Mf: Masculinity-femininity ` 0. Si: Social introversion 


Eight of these scales were developed empirically, in the 1940s, by criterion key- 
ing of items that differentiated between small clinical samples, mostly of about 50 
persons each, representing the traditional psychiatric diagnosis in use at the time, 
and a normal control group of 724 visitors and relatives of patients in the Univer- 
sity of Minnesota hospitals (Hathaway & McKinley, 1940, 1943). The Masculinity- 
femininity scale, originally intended to distinguish between homosexual and 
heterosexual men, was developed from the differences between male soldiers and 
female airline employees in item-endorsement frequency. Scores on this scale in- 
dicate the extent to which a person’s interests and attitudes match the stereotype 
of her or his sex group. The Social introversion scale, added later, was derived 
from the responses of two contrasted groups of college students selected on the 
basis of extreme scores on a test of introversion-extraversion. 


*From Minnesota Multiphasic Personality Inventory-2. Copyright © by The Regents of the University of Min- 
nesota. All rights reserved. Reproduced by permission. 
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'— In retaining the basic scales of the MMPI intact (except for nine deleted items, 
a number of editorial changes, and a reordering of the items), the MMPI-2 devel- 
-opers sought to preserve the wealth of clinically useful information associated 
with the interpretation of profile codes based on patterns of scores on those scales 
: (Graham, 1993; Greene, 1991). However, also retained were the obsolete notions 
-of psychopathology implicit in those scales and the consequences of the naive 
'and flawed application of the empirical method of contrasted criterion group key- 
‘ing. Other problems, such as the multidimensionality and overlap of the basic 
«scales, were kept as well (Helmes & Reddon, 1993). 

^ “An outstanding feature of the original MMPI was its use of three so-called va- 
lidity scales, which have also been retained in the MMPI-2.° These scales are not 
concerned with validity in the technical sense. In effect, they represent checks on 
carelessness, misunderstanding, malingering, and the operation of special response 
sets and test-taking attitudes. The validity scores include: 


Lie Score (L): based on a group of items that make the respondent appear in a 
favorable light but are unlikely to be truthfully answered in the favorable di- 
rection (e.g., I do not like everyone I know). 

Infrequency Score (F): determined from a set of 60 (out of the original 64) items 
answered in the scored direction by no more than 1096 of the MM PI stan- 
dardization group. Although representing undesirable behavior, these items 
do not fit any particular pattern of abnormality. Hence, it is unlikely that 
any one person will actually show all or most of. these symptoms. A high F 
score may indicate scoring errors, carelessness in responding, gross eccen- 
tricity, psychotic processes, or deliberate malingering. 

Correction Score (K): using still another combination of specially chosen items, 
this score provides a measure of test-taking attitude believed to be more sub- 
tle. A high K score may indicate defensiveness or an attempt to “fake good." 
A low K score may represent excessive frankness and self-criticism or a de- 

y liberate attempt to “fake bad.” 


mi 


The first two scores (L, F) are ordinarily used for an overall evaluation of the 
test record. If either of these scores exceeds a specified value, the record is consid- 
ered invalid. The K score, on the other hand, was designed to function as a sup- 
‘pressor variable. It is employed to compute a correction factor that is added to the 
scores on some clinical scales to obtain adjusted totals. The effectiveness of this 
use of the K score is questionable; therefore, scores on the affected scales can be 
reported with and without this correction. Although an unusually high K score 
would in itself make a record suspect and call for further scrutiny, moderate eleva- 
tions of the K scale may in fact reflect ego strength and a positive adjustment. It is 


particularly important to evaluate elevations in light of the individual’s history 
and life circumstances. 


5There is also a Cannot Say (?) score representing the number of items that were double-marked or omitted. If 
this count exceeds 30 items, the test record is considered highly Suspect and probably invalid 
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Among the 21 supplementary scales of the MMPI-2 are three new “validity” 
indicators that can help to assess the care and veracity with which the-test takers 
respond to the inventory. They are the Back F (Fy) scale, the Variable Response 
Inconsistency Scale (VRIN), and the True Response Inconsistency Scale 
(TRIN). Whereas the F, scale is basically an extension of the original F scale for 
items that appear in the second half of the inventory, VRIN and TRIN are new 
scales that consist of pairs of items with similar Or Opposite meanings and are 
aimed at detecting inconsistent or contradictory responses. 

The basic profile form for the MMPL2 (Fig. 13-1) includes the 13 validity and 
clinical scales carried over from the original version. There are also separate pro- 
file forms for 15 content scales, 27 content component scales, 21 supplementary 
scales, and 28 Harris-Lingoes subscales.$ Some of these scales and subscales are 
new, and some have been retained from the original; all of them, however, are 
scored by using the MMPI-2 normative sample of 2,600 adults, aged 16 to 84. 
This sample is far more representative of the current population of the United 
States than the original Minnesota normative group, having been collected in 
seven different states in an attempt to reflect the U.S: population in terms of sig- 
nificant demographic variables, including gender, age, and ethnicity (Dahlstrom 
& Tellegen, 1993). Nevertheless, the representativeness of the sample has been 
questioned, primarily because of its high levels of occupational and educational 
attainment and underrepresentation of Hispanics and Asian Americans, com- 
pared with the 1980 census figures (Duckworth, 1991). 

The main result of the renorming of the MMPI was a lowering in the score ele- 
vation of clinical profiles. This change, which had been widely anticipated, is 
probably due to generational differences, as well as to special factors unique to the 
original Minnesota sample and to the way in which it was used in the development 
of the MMPI (D. S. Nichols, 1992). At any rate, the cutoff T score necessary for 
considering a scale elevation to be of clinical interest is now 65, or approximately 
1.5 SDs above the mean, as opposed to 70. Another innovation introduced in the 
MMPI-2 is the use of uniform—as opposed to linearly derived or normalized —T 
scores in 8 of the 10 clinical and all of the content scales. This involved equating 
the scores on all those scales ro ap average composite distribution; the uniform T 
scores allow comparisons across scales in terms of percentile equivalents, without 
significantly distorting the shape of the raw score distributions which are all posi- 
tively skewed, albeit to different degrees (Tellegen & Ben-Porath, 1992). 

The changes just described were certainly justified on psychometric grounds; 
however, because these changes result in differences between the profile patterns 
and codes obtained from the MMPI and the MMPI-2, a great debate has arisen 
concerning the viability of generalizing the findings from the vast MMPI profile 
interpretation literature, and from clinical lore, to the MMPI-2 (Chojnacki. & 
Walsh, 1992; Morrison, Edwards, & Weissman, 1994; Tellegen & Ben-Porath, 
1993). The data suggest that, for well-defined profiles with clear separation be- 
tween scale scores, there seems to be about as much congruence between MMPI 


As of 1996, the 10 Wiener-Harmon Subtle-Obvious subscales are no longer available from the MMPI-2 publisher. 
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and MMPI-2 code types as between those obtained from repeated administrations 
of either one of the versions (Archer, 1992b; Graham, 1993). At any rate, the 
MMPI-2 manual has information that permits users to compare scores generated 
from the two versions based on the responses to either one: Although this sugges- 
tion is not without its problems (see, e-g., Ben-Porath & Tellegen, 1995), it has 
been endorsed by some as an empirically defensible method for users to negotiate 
the period of transition between the two versions (see, e.g., Humphrey & 
Dahlstrom, 1995). 


The Minnesota Multiphasic Personality Inventory-Adolescent. The 
MMPLA is the new form of the MMPI developed specifically for use with 
adolescents. It incorporates most features of the MMPI and MMPI-2, including 
all of the 13 basic scales, but accommodates younger test takers through the 
reduction of the overall length of the inventory to only 478 items, the inclusion 
of new items and scales covering areas specifically relevant to them such as school 
and family problems, and, above all, the provision of age-appropriate norms. The 
MMPLA used a normative sample of 1,620 contemporary adolescents between 
the ages of 14 and 18; a clinical sample of 713 adolescents in the same age range 
was collected concurrently for use in comparisons and validity studies. 

In addition to the basic clinical and validity scales it shares with the MMPI-2, 
the MMPI-A has its own validity scales (Fl and F2), as well as some content and 
supplementary scales and subscales that are unique to it and some that are com- 
mon to both instruments. Although a great deal of research, including norms and 
conversion tables published by Dahlstrom et al., (1972) and Marks, Seeman, and 
Haller (1974), supported the use of the MMPI with adolescents, that research 
does not necessarily apply to the MMPI-A, which is more of a brand new instru- 
ment than a revision. As such, its usefulness will have to be determined through 
the accumulation of research and interpretive materials that started concurrently 
with its publication (Archer, 1992a; Butcher & Williams, 1992; Williams, 
Butcher, Ben-Porath, & Graham, 1992). 


Concluding Comments on the Minnesota Multiphasic Personality 
Inventories. In spite of its origin asa prototypical product of a naively applied 
empiricism, and of repeated rumors about its imminent demise, the MMPI has 
managed to survive: In fact, although its original purpose was to aid in the process 
of psychiatric classification, and the procedures followed in its development made 
it unsuitable for personality assessment in normal individuals, the MMPI has 
been widely used with normals as well as psychiatric patients. It and its revised 
versions have incorporated a number of additional procedures and interpretive 
strategies into the original, empirically derived framework of the inventories. 
Such features, which have already been mentioned in passing but deserve further 
Notice, include the many scales developed by grouping items on the basis of vc 
content (Butcher, Graham, Williams, & Ben-Porath, 1990) as well as the use o 
SH analysis in the development of some of the supplementary scales (Welsh, 
956). 
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New ways of approaching the complex task of MMPI interpretation continue 
to evolve. One of the most recent is the use of structural summaries to bring some 
coherence and ease to the use of the multiplicity of intercorrelated scales pro- 
duced by the Minnesota inventories. The approach, based on analyses of scales, 
subscales, and items, aims at reducing the number of dimensions necessary to in- 
terpret the results of the inventories by cutting across arbitrary classifications— 
such as “Supplementary,” “Content,” and “Harris and Lingoes” scales. The most 
salient dimensions are used to organize the categories into a “Structural Sum- 
mary” format similar to the one developed by Exner for use with the Rorschach 
(see chap. 15). This approach to MMPI interpretation is at the incipient stages of 
development and in need of further investigation and validation. Nevertheless, it 
is already being applied both to the MMPI-2, with structural dimensions derived 
primarily on the basis of content analysis, and to the MMPI-A, with dimensions 
derived through factor analysis, and it seems promising (Archer & Krishna- 
murthy, 1994; Archer, Krishnamurthy, & Jacobson, 1994; Nichols & Greene, 
1995). 

Two other sets of developments that proceed apace with the MMPI-2 and 
MMPI-A, as with most other tests, are the computerization of procedures for ad- 
ministration, scoring, and interpretation of the inventories and the development 
of translations of the instruments into many languages. Conventional computer 
administration and scoring of the inventories, which were available for the 
MMPI, are also available for the new inventories, as are several automated inter- 
pretation services. In addition, a computerized adaptive version of the MMPI-2 
has been developed and tried with promising results (Roper, Ben-Porath, & 
Butcher, 1991, 1995). 

Whereas it took almost a decade for the original MMPI to be translated for the 
first time, work on cross-cultural adaptations of the MMPI-2 began even prior to 
its publication. In the first three years of its existence, there were 15 translation 
projects of the MMPI-2 completed or in progress. Six Spanish translations or 
adaptations, including two versions for use with Hispanics in the United States, 


have been prepared. A handbook on translations and international adaptations of 


the MMPI-2 is also available (Butcher, 1996). One noteworthy finding of those 
who have collected data with the use of such translations and adaptations is that 
current cross-cultural normal samples score closer to the MMPI-2 standardization 
norms than earlier such samples did compared to the MMPI norms. 

By trying to improve on the classic example of a personality inventory without 
altering it in any fundamental way, the MMPI Restandardization Committee set 
for itself two difficult and, to a large extent, contradictory goals. Time will tell 
whether the decisions the committee made will extend the supremacy of the 
Minnesota inventories into the next century or whether they will be overtaken 
by a new generation of similar instruments, such as Jackson's Basic Personality In- 
ventory— discussed in a later section of this chapter—or the Personality Assess" 
ment Inventory (PAI) developed by Leslie. Morey (1991) using a sophisticated 
sequential strategy that combined logical and empirical methods to ensure the 
psychometric soundness of its scales. In the meantime, the rate and range of pub- 
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lications of books and articles on the MMPI-2 and MMPI-A seems to be continu- 
ing unabated (Butcher, 1990; Butcher, Graham, & Ben-Porath, 1995; Keller & 
Butcher, 1991; Pope, Butcher, & Seelen, 1993). 


California Psychological Inventory 


Over the years of its existence, the MMPI has served as a basis for the develop- 
ment of other widely used inventories. An outstanding example is the California 
Psychological Inventory (CPI). While drawing nearly half of its items from the 
MMPI, the CPI was developed specifically for use with normal adult populations. 
In its latest revision, the CPI—Third Edition—consists of 434 items to be an- 
swered “True” or "False" and yields scores on 20 scales (Gough & Bradley, 1996). 
Three are "validity" scales designed to assess test-taking attitudes. These scales are 
designated as: Well-being (Wb), based on responses by normals asked to "fake 
bad"; Good impression (Gi), based on responses by normals asked to "fake good"; 
and Communality (Cm), based on a frequency count of highly popular responses. 
The remaining 17 scales provide scores in such personality dimensions as Domi- 
nance, Sociability, Self-acceptance, Responsibility, Socialization, Self-control, 
Achievement-via-conformance; Achievement-via-independence, Empathy, and 
Independence. The last two scales were added in the 1987 revision. 

For 13 of these 17 scales, items were selected on the basis of contrasted group 
tesponses, against such criteria as course grades, social class membership, partici- 
pation in extracurricular activities, and' ratings. The ratings were obtained 
through peer nominations, which have been found to constitute an effective as- 
sessment technique for many interpersonal traits (see chap. 16). For the remain- 
ing 4 scales, items were originally grouped subjectively and then checked for 
internal consistency. Cross-validation of all scales on sizable samples has yielded 
significant group differences, although the overlapping of contrasted criterion 
groups is considerable and criterion correlations are often low. 

As in the MMPI-2, all CPI scores are reported in terms of a standard score scale 
with a mean of 50 and an SD of 10; at present, this scale is based on a normative 
sample of 3,000 females and 3,000 males drawn from the CPI archives so as to 
represent the general population of the United States in terms of age, socioeco- 
nomic level, and geographic area. Norms are provided separately for females, 
males, and both genders combined. In addition, means and SDs of scores on each 
scale are given for many special groups. ? e 

The CPI, originally published in 1956, was conceptualized as an "open system 
from which elements could be removed and to which elements could be added as 
the need arose (Gough, 1987, p. 1). Accordingly, through the revisions it has un- 
dergone, the inventory has been shortened from its original length of 480 items— 
down to 462 in the 1987 revision and most recently to 434 items—to delete those 
that might be objectionable to some respondents or legally questionable in light of 
thé Americans with Disabilities Act of 1990 (PL. 101-336), especially in person- 
nel selection settings. Using the extensive archival data available for more than 
13,000 subjects on all of the CPI items, Gough and Bradley have striven to 
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maintain the reliability and validity of the basic scales by keeping their length 
constant through substitution of deleted items with others that are functionally 
equivalent in terms of the extent of their correlation with the criteria for scale 
membership. 

Research with the CPI has provided a great deal of information that is helpful 
in analyzing profiles based on both individual scale elevations and configurations 
or patterns of scores on two or more scales, in a manner akin to the MMPI’s code 
type interpretation tradition (McAllister, 1996). Cross-cultural studies suggest 
that the CPI is useful in exploring personality differences between ethnic groups 
(see Dana, 1993; Davis, Hoffman, & Nelson, 1990). In addition to the basic 
scales, several others have been developed by various methods. Examples include 
a Managerial Potential Scale and a Work Orientation Scale (Gough, 1984, 
1985). An applications guide has been prepared as an aid to the professional user, 
with special reference to the use of the CPI in personnel selection and career de- 
velopment (Meyer & Davis, 1992). 

Since its 1987 revision, the CPI has also included a three-dimensional typolog- 
ical model for classifying high-scoring and low-scoring individuals on three struc- 
tural or vector scales identified by factor and item analyses. The structural scales 
measure the higher-order dimensions of Internality vs. Externality, Norm- 
acceptance vs. Norm-rejection, and Self-realization. Scores on the first two of 
these scales are used to categorize respondents into four personality types (Alpha, 
Beta, Gamma, and Delta) while scores on the third vector scale purportedly gauge 
level of integration or realization of the positive potential associated with one’s 
type. The typological model is apparently an appealing feature for organizational 
users of the inventory but it has been criticized for the vague description of the 
procedures used to derive it (Engelhard, 1992). In addition, it shares the concep- 
tual and psychometric’ weaknesses inherent in the classification of people into 
types based on arbitrarily dichotomizing one or more continuous dimensions. 


Personality Inventory for Children 


Although not utilizing MMPI items or data, the Personality Inventory for Chil- 
dren (PIC) was constructed with the same general methodology as the MMPI and 
the CPI (Wirt & Lachar, 1981; Wirt, Lachar, Klinedinst, & Seat, 1991). The PIC 
was developed through some 20 years of research by a group of investigators at the 
University of Minnesota who had been thoroughly exposed to the rationale an 

clinical use of the MMPI. It 1s designed for children and adolescents between the 
ages of 3 and 16. A major difference between the PIC and the MMPI pertains to 
the way the information is obtained; the true-false questionnaire items are an- 
swered not by the child but by a knowledgeable adult, usually the mother. This 
procedure is consistent with the common practice in children’s clinics of intet- 
viewing parents as the principal source of information about the child's present: 
ing problems and case history. This inventory, in effect, provides a systematic Way 


"See the discussion of the Myers-Briggs Type Indicator in chapter 16. 
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of gathering such information and of interpreting it in terms of normative and di- 
agnostic data. 

The original PIC comprised a total of 600 items, grouped into three "validity 
scales,” a general screening scale, and 12 clinical scales. The validity scales in- 
clude: the Lie scale, consisting of items that make the child appear in an unrealis- 
tically favorable light; the Frequency scale, comprising rarely endorsed items; and 
the Defensiveness scale, designed to assess parental defensiveness about the 
child’s behavior. The screening scale, namely, Adjustment, is used to identify 
children in need of psychological evaluation. The 12 clinical scales were designed 
to assess the child’s cognitive development and academic achievement, several 
well-established types of emotional and interpersonal problems (e.g., Depression, 
Anxiety, Withdrawal, and Hyperactivity), and the psychological climate of the 
family. 

In the current Revised Format version of the PIC (PIC-R), the items have 
been reordered and reduced in number to 420; the administration booklet groups 
them into three progressively longer parts. Part I (items 1-131) yields scores on 
the Lie scale and four new, broad-based factor scales. Part I (items 132-280) adds 
to Part I shortened versions of the other regular scales and a partial list of critical 
items. Part III (items 281-420) adds the rest of the items needed for the full com- 
plement of 16 original scale scores, four broad factor scores, and the complete set 
of critical items. 

Seven of the original 16 PIC scales were developed by empirical comparisons 
of response frequencies in criterion and control groups; through an iterative pro- 
cedure, items were added in stages until optimum scale validity was reached. The 
other nine scales followed essentially content validation procedures whereby 
items were initially chosen on the basis of judges’ nominations or ratings for scale 
relevance. Even in these cases, however, assessment of internal consistency of 
item responses within scales and factor analyses of items contributed to the con- 
struct validation of the scales. i SN, 

A monograph prepared by Lachar and Gdowski (1979) provides extensive in- 
terpretive data for the original PIC, based on a systematic, comprehensive valida- 
tion study; a manual supplement prepared for the Revised Format version does 
the same for the factor scales and provides the psychometric data on the short- 
ened scales (Lachar, 1982). In addition, Lachar and his associates have used clus- 
ter analyses to classify heterogeneous samples of children on the basis of their PIC 
profiles and have studied the diagnostically significant characteristics associated 
with different PIC profile types. Based on this continuing research, sequential 
tules for profile classification into types and procedures for calculating indices of 
profile similarity have been developed (Gdowski, Lachar, & Kline, 1985; Kline, 
Lachar, & Gdowski, 1992). This actuarial approach to interpretation, which has 
also been used with the MMPI and the CPI, is an extension and elaboration of 
the empirical criterion keying tradition from the development phase of a test to 
its interpretive phase (see, e.g. Kline, Lachar, & Boersma, 1993). à 

It is noteworthy that the PIC-R is not a self-report inventory but an inventory 
of observed behavior. As such, it is consonant with the behavioral assessment 
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otientation to clinical psychology described in chapter 17. However, parental re- 
porting does have limitations that have been generally recognized and noted by 
the authors of the PIC. As they point out, responses may reflect in part the moti- 
vation, attitudes, and personal or cultural standards of the parent. Thus, some in- 
consistencies among the reports of different observers, such as the two parents, 
and between parental reports and the self-reports of their children can be ex- 
pected and are, in fact, found. One way to deal with this unavoidable problem of 
differing perspectives is to assess special response tendencies that may have dis- 
torted the data through the "validity scales.” A second alternative is to gather and 
compare reports from more than one observer. Yet a third avenue is to gather and 
compare self-report and observer report data. 

To this end, the Personality Inventory for Youth (PIY) has been developed as a 
self-report measure that parallels the PIC-R. Although the PIY can be used by it- 
self, ideally, it provides the basis for a more comprehensive assessment using the 
profiles from parents and children conjointly (Lachar & Gruber, 1995a, 1995b). 
The authors of the PIY used the first 280 items of the PIC-R as a point of depar- 
ture for the item pool; most items were transformed from the third- to the first- 
person format with little change (e.g., “My child often brings friends home” 
became “I often bring friends home”). Some items had to be edited more substan- 
tially to conform to their intended meaning; others had to be discarded as inap- 
propriate for self-report at the target age level (which was set at 9 to 18 years of 
age) due to the problems with self-reporting at younger ages. A few items are alto- 
gether new. The final version of the PIY consists of 270 items comprising nine 
nonoverlapping clinical scales, 24 nonoverlapping subscales, and four “validity 
scales.” The first 80 items can be used as an Abbreviated Form for screening pur- 
poses. The PIY was developed and standardized using a sample of 2,337 students 
in regular education classes and a clinical sample of 1,178 children and adoles- 
cents (Lachar & Gruber, 1993). 

Much work remains to be done with both the PIC-R and the PIY. The norms 
of the PIC, gathered in the late 1950s and early 1960s, should be updated, and its 
research base for the preschool age range needs to be expanded (Knoff, 1989). As 
far as the PIY is concerned, its clinical usefulness both by itself and in combina- 
tion with the PIC-R remains to be established. However, these two instruments 
share an impressive empirical foundation and the advantage of providing an inte- 
grated set of multidimensional tools specifically designed for use with children 
and adolescents. 


FACTUR ANALYSIS IN TEST DEVELOPMENT 


Inan effort to arrive av a systematic classification of personality traits, a number of 
psychologists turned to factor analysis. This technique, already discussed in con- 
nection with the organization of cognitive abilities, is ideally suited to the task of 
reducing the number of categories Necessary to account for behavioral phenom- 
ena by looking for consistent patterns in their occurrence. It will be recalled that 
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an element of subjectivity is likely to enter into the identification of factors, since 
the process depends on an examination of measures or items having the highest 
loading on each factor (see chap. 11). Hence, the cross-identification of factors 
from separate investigations using different measures is difficult and has been a 
source of discrepancies in the labels applied to traits as well as in their number. In 
addition, theré are a variety of ways in which factor analysis can be applied to the 
study of personality traits. Of these, two major traditions of research have coex- 
isted for several decades and have, to some extent, coalesced in recent years. 

One of the traditions, centering on the use of personality questionnaire data, is 
exemplified by a series of studies by Guilford and his co-workers (see Guilford, 
1959, chap. 16; Guilford & Zimmerman, 1956). Rather than correlating total 
scores on existing inventories, these investigators computed the intercorrelations 
among individual items from many personality inventories. As a by-product of 
this research, three personality inventories were developed and eventually com- 
bined into the Guilford-Zimmerman Temperament Survey. This inventory yields 
separate scores for ten traits, each score based on 30 different items. Examples of 
these traits include Restraint, Ascendance, Emotional Stability, and Friendliness. 

‘Another of the pioneering applications of factorial methods to the develop- 
ment of personality inventories, sometimes referred to as the “lexical” tradition, is 
to be found in the work that R. B. Cattell started in the 1940s (John, Angleitner, 
& Ostendorf, 1988). In an attempt to obtain a comprehensive description of per- 
sonality, Cattell began by assembling all personality trait names occurring either 
in the dictionary (as compiled by Allport and Odbert, 1936) or in the psychiatric 
and psychological literature. This list of almost 18,000 terms was first reduced by 
combining obvious synonyms. The smaller trait list was then employed in obtain- 
ing associates’ ratings of a heterogeneous group of adults. Intercorrelations and 
factor analyses of these ratings and of self-report questionnaire data led to the 
identification of what Cattell described as “the primary source traits of personal- 
ity,” a designation that seems to imply more universality and stability of results 
than appear justified by the antecedent research. It is characteristic of Cattell’s 
approach that he regards factor analysis not as a data-reduction technique, but as 
a method for discovering underlying, causal traits (Cattell, 1979). 


The Sixteen Personality Factor Questionnaire (16 PF). On the basis of 
their factorial research, Cattell and his co-workers have developed a number of 
personality inventories, of which the best known is the Sixteen. Personality 
Factor Questionnaire, currently in its fifth edition (Cattell, Cattell, & Cattell, 
1993; Conn & Rieke, 1994; Russell & Karol, 1994). Originally published in 1949, 
the 16 PF is designed for ages 16 and over and yields 16 scores in such traits as 
Social Boldness, Dominance, Vigilance, Emotional Stability, and Rule 
Consciousness. The 16 factors, i 
editions of the 16 PE have been refined over the years and renamed, as the 
esoteric terminology that Cattell originally used to designate the traits has been 
largely discarded. For example, the extremes anchoring the dimension now ca ed 
Social Boldness were first labeled “Threctia” and “Parmia,” at the shy and bold 
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ends, respectively. The second-order factors of the 16 PE which previously ranged 
between four and eight in number, are now designated as "global factors" and 
limited to 5, a number that conforms to the popular five-factor model to be 
discussed in the next section. 

The fifth edition of the 16 PF is available in only one form and contains 185 
items, most of which were selected from previous forms of the questionnaire on 
the basis of their content and the quality of their psychometric properties. The 16 
PF has been renormed on a sample of 2,500 individuals selected to represent ap- 
proximately the U.S. population as of the 1990 census in terms of gender, race, 
age distribution, and education. One of the unique features of the 16 PF is the in- 
clusion of 15 items presented contiguously at the end of the inventory under the 
title “Problem Solving Questions”; these items comprise the Reasoning scale, 
which is intended as a quick measure of mental ability. In addition, the question- 
naire now has three indices of response styles designed to assess acquiescence, 
random responding, and attempts to present oneself unrealistically as having ei- 
ther socially desirable or undesirable qualities. 

Internal consistency and test-retest reliability for the 16 primary factor scales 
of the 16 PF are better in the fifth edition than they were in earlier ones. Simi- 
larly, the Technical Manual for this edition contains a great deal more informa- 
tion on validity than did earlier manuals. However, the problem of lack of 
factorial independence of the 16 primary scales, evident in earlier editions, 
seems to persist in this one. This difficulty is underscored by the uniform inabil- 
ity of other investigators who have used Cattell’s original variables to replicate 
anywhere near as many as 16 factors. Instead, most studies using the data on 
which Cattell based his system have found only between 4 and 7 factors (Dig- 
man, 1990; L. R. Goldberg, 1993). The first attempts to replicate Cattell's pri- 
mary factors were made by D. W. Fiske (1949) and resulted in a five-factor 
solution; this work is now widely cited as the earliest version of the contem- 
porary model discussed next. 


The "Five-Factor Model" and Why it Works. Current writings on per- 
sonality assessment have given increasing attention to the so-called Five-Factor 
Model (FFM), which represents an unusual level of consensus among personality 
researchers from the various factor analytic traditions (Costa & Widiger, 1994; 
Digman, 1990; McCrae & John, 1992; Wiggins & Pincus, 1992). At the same 
time, the particular way in. which the model has been. presented has led to 
considerable criticism and controversy. (Block, 1995; Carlson, 1992; Goldberg, 
1993; Kroger & Wood, 1993; Loevinger, 1994).5 Essentially, the Five-Factor 
Model is an attempt to use a hierarchical pattern of analysis in order to simplify 
the vast collection of available data about the affective behavior of individuals. It 
thereby renders the information more manageable in assessing individuals and 
predicting their behavior in given situations. The factors are descriptive rather 


& " A ^ 
ie yap ee following Kroger and Wood in the same Comment section of the American Psychologist, 
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than explanatory; and they are no more basic than the individual items or specific 
tests from which they were derived. 

The Five-Factor Model is similar to the hierarchical structure obtained by fac- 
tor analysis of ability tests discussed in chapter 11. Although several investigators 

| over the past four decades have coincided in arriving at five factors as the number 
needed to account for most of the correlations among a vast amount of personal- 
ity description data, the appropriate number of factors for particular purposes 
could actually be better characterized as 5 +2; and the names given to each factor 
are also controversial (see, e.g., Digman, 1990, p. 423; Loevinger, 1994; Paunonen, 
1993). This is to be expected, since the factors reflect different selections of in- 
ventories, scales, response formats, and samples from whic they were derived. 

Some confusion and misunderstanding have resulted from the way the proce- 
dure for the Five-Factor Model was reported. There is an implication that the 
basic factors of personality have been “discovered” through some new approach. 
Actually, the factors correspond to the second level of a hierarchy resulting from 
factor analyses of personality test and rating scale data. It is noteworthy that the 
second level of the hierarchy, yielding factors of medium breadth, has proved 
most practicable, and replicable, in both cognitive and affective assessment. Be- 
cause they aggregate or combine measures from the lower levels of behavioral de- 
scriptors and of more narrowly defined traits, these factors yield more reliable 
scores—and the resulting validities are correspondingly higher. If a test battery 
has been constructed with a hierarchical model, and if norms are provided for dif- 
ferent levels, as is done in such cognitive batteries as the Differential Ability 
Scales described in chapter 8, the battery can be applied with maximum flexibil- 
ity to suit specific purposes. For example, after identifying the midlevel factor that 
is most characteristic of an individual, the analysis can be supplemented by exam- 
ining that person's scores on the narrower, more detailed tests on the lower level. 

The two investigators most closely, associated with the Five-Factor Model have 
developed a test that fits their version of the model. In its current edition, the Re- 
vised NEO Personality Inventory" (NEO PI-R—Costa & McCrae, 1992b) pro- 
vides scores on five major dimensions, Or domains, of personality and on 30 
additional traits, or facets, that identify each domain. Costa and McCrae avoid 
the use of the term “factors” to designate scales or components that might apply 
to any level of the hierarchy. The five major domains—Neuroticism (N), Extra- 
version (E), Openness to, Experience (O), Agreeableness (A), and Conscien- 
tiousness (C)——and their respective facets are listed in Table 13-1. 

The scales of the NEO PI-R were developed over 15 years of research. that 
started with longitudinal studies of aging in normal adult samples and was later 
extended to clinical, employment, and college samples. Although it was de- 
signed as a measure of “normal personality traits,” Costa and McCrae intend for 
the instrument to be useful in clinical and other applied settings, as well as in re- 
search. Among the methodological innovations introduced in the NEO PIR is 


°The letters NEO stand for Neuroticism, Extraversion, and Openness to Experience, but these words themselves 


are not used in the full title of the inventory: 
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Domains and Facets of the Revised NEO Personality Inventory 
(NEO PI-R) i 


TRAV OPENNESS TO EXPERIENCE 
 Gregariousness y Aesthetics 
„Assertiveness (E3) Feelings (O3) 

Activity (E4) ` Actions (04) 

. Excitement-Seeking (ES) Ideas (O5) 

Mura Emotions (E6) Values (O6) 


Bus -. CONSCIENTIOUSNESS 
I : Competence (C1) 

(A2) Order (C2) 

(0007 77 — "7 Dutifulness (C3) 

M "^ 07 Achievement (C4) 
ear Self-Discipline (C5) 
coo TenderMindedness(A6) Deliberation (Có) 
(Adapted from Costa & McCrae, 1992b, 
Reproduced by permission.) 


p. 2. Copyright € 1992 by Psychological Assessment Resources, Inc. 


the availability of a self-report form (Form S) and two versions of an observer- 
report form (Form R-Men and Form R-Women) that contain the same 240 items 
as Form S stated in the third person; Form R allows the possibility of obtaining 
independent ratings from peers, spouses, and others on the same domains as the 
self-ratings. This is especially important in the case of the NEO PLR because the 
inventory assumes an honest and cooperative test taker and contains no scales 
designed to check the veracity of responses. Norms for adult men and women are 
available for both forms, and norms for college-age men and women are provided 
for Form S. 

The Five-Factor (or “Big Five”) Model as postulated by McCrae and Costa has 
been widely, though by no means universally, accepted as a useful framework for 
the exploration of personality traits. Differences exist even among factor analyti- 
cally oriented investigators both in the number of intermediate-level factors they 
propose and in their definitions (L. R. Goldberg, 1993; Zuckerman, Kuhlman, 
Joireman, Teta, & Kraft, 1993). Nevertheless, in its various versions, the model 
has spurred a flurry of research activity aimed at the cross-identification of factors 
and integration of various perspectives, such as the significant aspects of the nor- 
mal and pathological personalities (see, e.g., Hofstee; de Raad, & Goldberg, 
1992). Related work in test development and refinement of existing scales is also 
proceeding apace (Costa & McCrae, 1994, 1995; Costa & Widiger, 1994; Hark- 
ness, McNulty, & Ben-Porath, 1995; Hogan & Hogan, 1992). 

In evaluating the results of this activity, one should keep in mind that factor 
analysis merely provides a technique for grouping items into relatively homoge- 
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neous and independent clusters. Such groupings facilitate the investigation of va- 
lidity against empirical criteria, permit a more efficient combination of scores for 
the prediction of specific criteria, and contribute toward construct definition. Ho- 
mogeneity and factorial purity are desirable goals in test construction. But they 
are not substitutes for empirical validation or sound theoretical grounding. 


PERSONALITY THEORY IN TEST DEVELOPMENT 


Personality theories have usually originated in clinical settings. The amount of 
experimental verification to which they have subsequently been subjected varies 
tremendously from one theoretical system to another. Regardless of the extent of 
such objective verification, a number of personality tests have been constructed 
within the framework of one or another personality theory. Clinically formulated 
hypotheses have been especially prominent in the development of projective 
techniques, considered in chapter 15. While this approach to test construction 
has been followed less often for self-report inventories, a few outstanding exam- 
ples are available. 3 à 


Millon Clinical Multiaxial Inventory: Although following the MMPI tra- 
dition in several ways and designed for the same purposes, the Millon Clinical 
Multiaxial Inventory-lll (MCMLIII—Millon, Millon, & Davis, 1994), originally 
published in 1977, introduces significant methodological innovations. In fact, its 
development was deliberately undertaken to meet the criticisms of the MMPI and 
to utilize intervening advances in the diagnosis of psychopathology and test 
construction. 

The MCMLIII is grounded in Millon's biopsychosocial views of personality 
functioning and psychopathology (Millon, 1969, 1981, 1990; Millon et al, 1996). 
His theory encompasses a matrix of personality styles derived from combining 
types along two dimensions, namely, source of reinforcement (i.e., detached, dis- 
cordant, dependent, independent, and ambivalent) and pattern of coping behav- 
ior (i.e., active or passive). Millon's theory of personality styles served as one of 
the conceptual bases in the original formulation of the Axis II personality disor- 
ders categories of the. Diagnostic and Statistical Manual of Mental Disorders—III 
(DSM-III—1980) prepared by the American Psychiatric Association, now in its 
fourth edition (DSM-IV—1994). In turn, the clinical scales of the MCMLIII are 
consistent with, though not identical to, the classificatory system followed in the 
DSM-IV. Efforts to coordinate the inventory as closely as possible with the evolv- 
ing framework of the DSMs have, in fact, prompted the frequent revisions that 
the MCMI has undergone. ^ 

The MCMLIII contains 175 brief, self-descriptive statements to be marked 
"true" or "false" by the respondent. The score profile includes 24 clinical scales, 
each based on 12 to 24 overlapping items that often appear in as many as three 
different scales, albeit with different weights; items that meet all of the validation 


criteria for their "home" scale are given a weight of 2, whereas subsidiary items 
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recelve a weight of 1. Ascan be seen in Table 13-2, the clinical scales are grouped 
—— namely, Clinical Personality Patterns, Severe Person- 


Pathology, Clinical Pe and. Severe Syndromes. The first two of 
oa scales designed to assess the enduring Axis I personality 
a" of the DSM, at different levels of severity. The other two ens 
compass some of the Axis I syndromes of the DSM. There are also three modify- 
ing índices and a Kail check des zue d to detect atypical response patterns and 
SSC The MOMI originally could be scored only by computer; cur- 
to mail-in scoring services and software for generating profiles 
MN hand scoring is available, although it is quite laborious 
because of ‘to incorporate score transformations and several adjustments. 
.. One of the most: t innovations introduced by the MCMI is the use of 
standard scores Base Rate (BR) scores which, instead of being normalized, 
are anchored to the prevalence rates of the characteristics being measured. The 
cutoff BR scores of the Zn are = reflect actuarial base rate data 
within populations particular conditions the scales assess and 
d diagnosis. Because prevalence rates may vary in differ- 
ent clinical populations and settings, the BR scores of some of the MCMI-II 
sal can be adjusted for setting, chronicity, and scores on anxiety and depres- 
us as well as certain response patterns. 
Item development for the MCMI followed the multifaceted approach charac- 
teristic of recent practice in the construction and validation of personality inven- 


h Scales of the Millon Clinical Dee có dia 


This ben Millen et d. 1994, p. MÀ INC Ripsdusid w permission.) 
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Personality Inventory (MAPI—Millon, Green, & Meagher, 1982), an older in- 
strument originally designed for use in both clinical assessment as well as voca- 
tional counseling and academic advising, with scales that assess the basic 
personality styles, expressed concerns, and behavioral tendencies of adoles- 
cents.!! The Millon Index of Personality Styles (MIPS—Millon, 1994), on the 
other hand, is intended as a measure of personality for normal adults who seek as- 
sistance for work, family, or social problems in various counseling settings. The 
MIPS was standardized on adult and college samples and combines elements of 
Millon’s personality theory with those of Freud and Jung. 


Edwards Persona! Preference Schedule. | Among the personality theories 
that have stimulated test development, one of the most prolific has been the 
manifest need system proposed by Murray and his associates at the Harvard 
Psychological Clinic (Murray et al., 1938). One of the first inventories designed 
to assess the strength of such needs was the Edwards Personal Preference Sched- 
ule (EPPS—Edwards, 1959). Beginning with 15 needs drawn from Murray’s list; 
Edwards prepared sets of items whose content appeared to fit each of these needs. 
Examples include the need for Achievement (to do one’s best and accomplish 
something difficult), Deference (to conform to what is expected of one), 
Exhibition (to be the center of attention), Intraception (to analyze the motives 
and feelings of oneself and others), Dominance (to influence others and to be 
regarded as a leader), and Nurturance (to help others in trouble). 

The inventory consists of 210 pairs of statements in which items from each of 
the 15 scales are paired with items from the other 14.14 Within each pair, the test 
takers choose one statement as more characteristic of themselves. It is important 
to bear in mind that, due to this forced choice, the EPPS results in ipsative 
scores—that is, the strength of each need is expressed not in absolute terms, but 
in relation to the strength of the individual’s other needs. The frame of reference 
in ipsative scoring is the individual rather than the normative sample. Because 
the sum of all the subscale scores is a constant for all test takers, if a person’s score 
on one subscale moves one point up, another subscale score must move one point 
down. Under these conditions, two individuals with identical scores on the EPPS 
may differ markedly in the absolute strength of their needs. Although the EPPS 
provides norms for conversion of its scores to percentiles, the advisability of con- 
version may be questioned because of the ipsative nature of the scores. While the 
ipsative frame of reference may be the most suitable for intraindividual compat- 
isons, such as those needed in the assessment of interests and other preferences, 
normative reference data are necessary for the sort of interindividual comparisons 
used, for example, in the assessment of abnormality (Fedorak & Coles, 1979). 


n 5 
The ns is now recommended as a tool for use in nonclinical assessments of personality in normal adoles- 
cents only. 


This ; à 
"This item form, which represents an important feature of the EPPS, will be discussed further in the next sec- 
tion, as an example of the forced-choice technique. 
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However, the combination of both frames of reference makes the interpretation 
of scores confusing and less meaningful than would be the case with a consistently 
ipsative or consistently normative approach. , 

Although the validity data reported in the EPPS manual are meager, a large 
number of independent validation studies have been published. The results of 
these validation studies, however, are often difficult to interpret because most of 
them have failed to take into account the ipsative nature of the scores. With ip- 
sative scores, the mean intercorrelation of individual scales tends to be negative 
and the mean correlation of all the scales with any outside variable will approach 
zero (Hicks, 1970). Owing to these artificial constraints, ipsative scores cannot be 
properly analyzed by the usual correlational procedures. It is not surprising, there- 
fore, that the published validation studies have yielded conflicting and inconclu- 
sive results (see, e:g: Piedmont, McCrae; & Costa, 1992). In spite of its simplicity 
and noteworthy features, the EPPS is in need of revision to eliminate technical 
weaknesses related to item form and score interpretation. 


Personality Research Form and Other Jackson Inventories. The 
Personality Research Form (PRF) reflects many technical advances in test con- 
struction, including some item-selection procedures that would have been virtu- 
ally impossible before the availability of high-speed computers. The PRF exempli- 
fies Douglas N. Jackson’s approach to personality test development, which begins 
with explicit, detailed descriptions of the constructs to be measured. These des- 
criptions form the basis for item writing as well as for defining the traits to be 
rated by judges in the validation studies (Jackson, 1970, 1989b). 

The PRE is available in five different options, including two sets of parallel 
forms (A,B and AA,BB) of 300 and 440 items, respectively. The longer forms 
provide 22 scale scores each based on 20 items, including two validity scores, In- 
frequency and Desirability; the shorter forms have only fifteen 20-item scales. An 
additional version (Form E), developed later through the use of more sophisti- 
cated item analytic techniques, consists of 352 of the best items from the longer 
forms and provides scores on all 22 scales, although each contains only 16 items. 
Form E, which is now the most frequently used, also has an easier vocabulary level 
than the others. The Infrequency score, designed as an index of carelessness, fail- 
ure to understand directions, and other nonpurposeful responding, is based on the 
number of highly unlikely responses chosen by the test taker; examples include “I 
try to get at least some sleep every night” and “I make all my own clothes an: 
shoes.” Although desirability bias was substantially reduced in advance by the 
procedures employed in item development and selection, a Desirability scale is 
also provided. The manual correctly observes that unusually high or low scores on 
that scale may indicate not only atypical test-taking attitudes (e.g. deliberate at- 
tempt to create a favorable impression vs. malingering) but also important p 
sonality characteristics in their own right (e.g. high self-regard or high degree o! 
conventional socialization vs. low self-regard). f ' 

Like several other personality instruments, the PRF took Murray i mo el 
theory as its starting point. Drawing upon the extensive research and theoretica 


literature that had accumulated during three decades, Jackson formulated behav- 
iorally oriented and mutually exclusive definitions of 20 personality constructs or 
traits. Of these, 12 have the same names as those in the EPPS. For each of the 20 
b personality scales; the manual provides a description of high scorers and a 
set of defining trait adjectives. Two illustrative scale definitions are shown in 
Habledasgev et, i 
~ Through carefully controlled: procedures; pools of more than 100 items were 
. generated for each scale. Twenty items were then selected for each scale based on 
high biserial correlation with total scale score and low correlation with scores on 
other trait scales-and on the Desirability scale. Items yielding extreme endorse- 
ment proportions were eliminated. Through a specially developed computer pro- 
gram, items were assigned to the parallel forms in terms of biserial correlation with 
their own scale, as well as endorsement frequency. The procedures used in the con- 
struction of Form E included the calculation of an item efficiency index for each 
item. Based on weights derived from various statistical item parameters, the index 
allows each item to be ranked within its scale on the basis of its efficiency. 
~ The construct validity of the PRF depends to a large extent on the procedures 
followed in the development and selection of items for each scale. Subsequent 
factorial analyses corroborated the grouping of items into the 20 scales. Correla- 
tions with comparable scales in the California Psychological Inventory, Guilford- 
Zimmerman Temperament Survey, and NEO-PI, among others, provide additional 


Kai 


. Examples of Scale Definitions from Personality Research Form 
i Scale. à of High Scorer Defining Trait Adjectives 


^ precise, exacting, definite, " 

EC certainty, meticulous, — 

ene s perfectionistic, clarifying, 

licit, accurate, rigorous, ` 
ll, avoids ambiguity, 


Nes y, aware, notices 
© "environment, feeling, sensi“! 
tive, sensuous, open to ex- ` 
perience, perceptive, 
` noticing, discriminating, 


to impressions. 


Pic 


From Jackson, 1989b, pp. 6-7. Copyright © 1989 by Douglas N. Jackson. Reproduced by permission.) 


D 


Self-Report Personality Inventories 387 


support for the identification of traits. Although studies of the correlations be- 
tween various instruments designed to measure Murray's needs—such as the PRE 
EPPS, TAT (chap. 15), and ACL (chap. 16)—have yielded inconsistent results, 
there seems to be more support for the construct validity of the PRF than for some 
of the other tools; this would not be surprising in light of the careful attention 
given to the formulation of traits in the development of the PRF (Costa & Mc- 
Crae, 1988; D. W. Fiske, 1973; Rezmovic & Rezmovic, 1980). Data on the empir- 
ical validity of the PRF against. pooled peer ratings and self-ratings have also 
yielded promising results. It is noteworthy that. the PRF has proven to be applica- 
ble across several cultures, even in non-Western ones (Jackson, Guthrie, Astilla, 
& Elwood, 1983; Paunonen, Jackson, Trzebinski, & Forsterling, 1992). On the 
whole, as the extensive annotated bibliography compiled by MacLennan (1992) 
attests, the PRF is an excellent research instrument, but more information is still 
needed to determine its effectiveness in practical situations. 

The Jackson Personality Inventory-Revised (JPI-R), developed after the PRF 
through similar though more refined scale construction procedures, has a more 
practical orientation (Jackson, 1976, 1994a). The trait scales were chosen from the 
personality and social psychological literature partly because of their relevance to 
the prediction of behavior of normal subjects in a variety of contexts. Among the 
traits covered by the 15 scales are anxiety, cooperativeness, responsibility, social 
astuteness, and tolerance, Validity data were gathered not only through. correla- 
tions with peer ratings and self-ratings, using the multitrait-multimethod matrix 
model, but also through studies of particular groups for whom relevant behavioral 
data in real-life contexts were available. In the recent revision of this inventory, 
the norms for college students were updated and new norms for blue- and white- 
collar workers were developed to support its use in academic counseling and em- 
ployment settings. In addition, a number of technical modifications and minor 
changes were made on the scales, which are now grouped into five higher-order 
clusters that are largely compatible with the Five Factor Model (FFM) categories 
discussed in an earlier section. Specifically, the JPI-R5 Extroyerted, Dependable, 
and Analytical clusters are substantially similar to the Extraversion, Conscien- 
tiousness, and Openness to Experience dimensions of the FFM, respectively. In 
addition, the Emotional cluster of the JPI-R seems to represent a combination of 
the FFM's Neuroticism and Agreeableness. However, the Opportunistic cluster of 
the JPI-R, which draws from its Social Astuteness and Risk Taking scales, has no 
direct link to the FFM and is thus seen as a unique factor. 

Turning to the assessment of psychopathology, Jackson has employed the same 
stringent standards applied in the PRE and JPI to the construction of the Basic 
Personality Inventory (BPI—]Jackson, 19892). The BPI, which was developed over 
a period of about 15 years, aims to recreate the diagnostic efficiency associated 
with the MMPI, using scales that are more substantial in terms of content, psy- 
chometric purity, and range of applicability. Although the BPI is in need of more 


UThis instrument has been extensively reviewed in the Mental Measurements Yearbooks: 7th MMY, #123; 8th 
MMY, #643; 10th MMY, #282; see also TIP-IV for an updated reference list. 
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representative norms, especially for adults, it has already shown promise for clini- 
cal use in the area of juvenile delinquency (Holden & Jackson, 1992; for review, 
see Urbina, 1995). 

In this overview of self-report inventories, the reader has probably noticed an 
increasing tendency to combine different approaches. This is particularly true of 
the Jackson inventories, and of other new inventories developed in the past 
decade, which employ all strategies except the empirical criterion keying of items. 
In addition, even the Minnesota inventories, which are the preeminent examples 
of empirical criterion keying, now contain scales based on content and factor 
analysis. There is some evidence suggesting that personality inventory scales con- 
structed by any of the four major methods outlined in this chapter can all be ef- 
fective, at least with regard to convergent and predictive validity (Burisch, 1986). 
However, the content- and theory-based procedures are easier to construct, more 
efficient to use, and more likely to show discriminant validity than the empirical 
criterion keying approach, which in turn is the one that is least compatible with 
the rest. Furthermore, there is ample agreement to the effect that: (a) the devel- 
opment of inventories should start with an explicit definition of the trait or con- 
struct to be measured; and (b) the organizational structure outlined by Campbell 
and Fiske in their multitrait-multimethod matrix (see chap. 5) provides an opti- 
mal strategy for the examination of the construct validity of personality invento- 
es iar John, & Lóhr, 1986; Hogan & Nicholson, 1988; Ozer & Reise, 


TEST-TAKING ATTITUDES AND RESPONSE BIASES 


Faking and Social Desirability. Self-report inventories are especially 
subject to the possibility of deliberate misrepresentation. Despite introductory 
remarks to the contrary, most items on such inventories have one answer that is 
recognizable as socially more desirable or acceptable than the others. On such 
tests, respondents may be motivated to “fake good,” or choose answers that create 
a favorable impression, as when applying for a job or seeking admission to an 
educational institution. Under other circumstances, respondents may be moti- 
vated to “fake bad,” thus making themselves appear more psychologically 
disturbed than they are. This may occur, for example, in the testing of persons on 
trial for a criminal offense. 

3 Evidence of the success with which respondents can dissemble on personality 
inventories is plentiful (see, e.g., Jacobs & Barron, 1968; Radcliffe, 1966; Stricker, 
1969; J. S. Wiggins, 1966). A common classroom demonstration consists in ask- 
ing different groups to assume specified roles. For example, one section of the class 
is directed to answer each question as it would be answered by a happy and well- 
adjusted college student; another section is told to respond in the manner of a se 
verely maladjusted person; and in the last section, respondents are instructed to 
answer the items truthfully with reference to their own behavior. Or the same 
Persons may take the test twice, first with instructions to simulate in a specified 
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way and later under ordinary conditions. The results of such studies clearly 
demonstrate the facility with which the desired impression can be deliberately 
created. It is interesting to note that even specific simulation for a particular vo- 
cational objective can also be successfully carried out (Wesman, 1952). Current 
research indicates that, especially on self-report inventories, the “face validity"! 
of an item increases its susceptibility to faking, under both realistic and controlled 
laboratory conditions. The more readily respondents can identify the trait being 
assessed from the item wording, the more frequently they give the desirable re- 
sponse (Bornstein et al., 1994). 

The tendency to choose socially desirable responses on a self-report inventory 
need not indicate deliberate deception on the part of the respondent. A. L. Ed- 
wards (1957), who first investigated the social desirability variable, conceptual- 
ized it primarily as a fagade effect, or tendency to “put up a good front,” of which 
the respondent is largely unaware. This tendency may indicate lack of insight 
into one's characteristics, self-deception, or an unwillingness to face up to one's 
limitations. Other investigators (Crowne & Marlowe, 1964; N. Frederiksen, 
1965) have presented evidence to suggest that the strength of the social desirabil- 
ity response set is related to the individual's more general need for self-protection, 
avoidance of criticism, social conformity, and social approval. On the other hand, 
the individual who chooses unfavorable items in a self-description may be moti- 
vated by a need for attention, sympathy, or help in meeting personal problems. 
Persons seeking psychotherapy, for example, are likely to make themselves appear 
more maladjusted on a personality inventory than they actually are. 

Furthermore, it cannot be assumed that basic research is free from the effects of 
response sets. In the context of studies on attitude change, for example, investiga- 
tors have shown that results may be influenced by such conditions as the partici- 
pant’s perception of what the experimenter expects, the desire to protect ones 
own image, and the wish to please or frustrate the experimenter (Silverman & 
Shulman, 1970). Unsuspected differences in these response sets may account in 
part for failures to replicate results when the experiments are repeated. 

Some investigators (Paulhus 1984, 1986; Paulhus & Reid, 1991) have under- 
scored the difference between the notions of impression management and self- 
deception as explanations for socially desirable—or undesirable— responding. 
Impression management refers to conscious dissembling designed to create a Spe 
cific effect desired by the respondent. It is seen as a contaminant of self-report 
data and as something that must in itself be assessed and, whenever possible, min- 
imized or contained. On the other hand, self-deception, which consists of usually 


positively biased responding that the test taker actually believes to be true, is a 


great deal more complex. Self-deception is linked to other concepts pertaining to 
oana lytic notion 


the self, such as self-image and self-esteem, as well as to the psych à seh 
of defense mechanisms. Thus, it is a variable worth studying in its own right as 


14See chapter 5. 


Pj ` D 
"Manipulation" and "other-deception" are sometimes used instead of "impression management. 
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possibly indicative of good adjustment, up to a point, and predictive of other in- 
dependent criteria. It has, for example, been noted that some self-report scales 
may give results that show "illusory mental health" on the part of defensive de- 
niers who strive to maintain a belief in their own adjustment (see, e.g., Shedler, 
Mayman, & Manis, 1993). Therefore, the relationship between self-deception 
and adjustment seems to be neither simple nor direct. To complicate things even 
further, it appears that certain linguistic characteristics of items interact with re- | 
spondent variables to produce biased responses (Helfrich, 1986). 

Several procedures have been followed in the effort to meet the problem of 
faking and related response sets in personality inventories, The construction of 
relatively “subtle” or socially neutral items may reduce the operation of these fac- 
tors in some inventories; however, as Jackson (1971) pointed out, often those 
items also have low validity for the dimension under consideration. Ideally, and 
in fact in a number of situations, the test instructions and the establishment of 
tapport should motivate test takers to respond frankly, if they are convinced that 
it is to their advantage to do so. This approach is ineffective in certain situations, 
however; and it probably does not have much effect on social desirability re- 
sponse sets of which the individual is unaware. 

Another approach to the evaluation of socially desirable responding and other 
forms of impression management is to construct special scales—which can either 
be embedded in an inventory or administered by themselves in a battery of 
tests—for that purpose. One of the earliest such scales was Edwards' (1957) social 
desirability (SD) scale, developed by selecting items on the basis of the agreement 
of judges regarding their extremely high or low social desirability values. Others, 
such as the CPI's Good Impression scale and LS Wiggins’ (1959) SD scale, have 
selected items on the basis of differences in endorsement frequencies by subjects 
responding under "fake good" conditions and those responding under normal 
conditions, A third method is exemplified by the MMPT' Lie scale, which con- 
tains items written so that they will be answered in the socially desirable direction 
only by respondents who exhibit an unrealistic positive bias. Other measures 
aimed. specifically at the detection of malingering and of careless or random re- 
sponding, such as the F scale of the MMPI, have been devised.'® Still another 


procedure, directed not to the detection but to the prevention of dissimulation, is 
the use of forced-choice items. 


RUN Technique. Essentially, the forced-choice technique requires 
€ respondent to choose between two descriptive terms or phrases that appear 
equally acceptable but differ in validity. The paired phrases may both be desirable 


GC 4 choice between two contrasting responses within the same trait, scored 
or a single scale. Although rarely employed in personality inventories, this item 
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For a thorough review of scales for the assessment of malingering, see Berry, Wetter, and Baer (1995). 
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form has the advantage of yielding normative rather than ipsative scores and 
hence imposing no artificial constraints on the interrelationships among different 
scales. An example is provided by the Myers-Briggs Type Indicator, discussed in 
chapter 16. 3 ; 

Use of the forced-choice technique to control social desirability requires two 
types of information regarding each response alternative—namely, its social de- 
sirability or “preference index” and its validity or “discriminative index.” The lat- 
ter may be determined on the basis of any specific criterion the inventory is 
designed to predict, such as academic achievement or success on a particular kind 
of job; or it may be based on the factor loading of items or their theoretical rele- 
vance to different traits. Social desirability can be found by having the items rated 
for this variable by a representative group, or by ascertaining the frequency with 
which the item is endorsed in self-descriptions. It has been shown that frequency 
of choice and judged social desirability correlate very highly (Edwards, 1957). In 
other words, the average self-description of a population agrees closely with its av- 
erage description of a desirable personality. 

Although the influence of social desirability may be reduced in forced-choice 
items, it cannot be assumed that it is completely eliminated. When items of the 
EPPS were presented in a free-choice format, the scores correlated quite highly 
with those obtained on the forced-choice format of the same test (Lanyon, 1966). 
Furthermore, the judged social desirability of particular items is not constant for 
all purposes but may differ for different occupations. Thus, a forced-choice test 
whose items were equated in general social desirability could still be faked when 
taken by job applicants, candidates for admission to professional schools, and 
other specifically oriented groups. From another angle, it has been found that 
when items are paired on the basis of average group judgments of general social 
desirability, they may be far from equated for individuals (N. Wiggins, 1966). 

In conclusion, it appears that the forced-choice technique has not proved as ef- 
fective as had been anticipated in controlling faking or social desirability response 
sets. At the same time, the forced-choice item format, particularly when it yields 
ipsative scores, introduces other technical difficulties and eliminates information 
about absolute strength of individual characteristics that may be of prime impor- 
tance in some testing situations. 


Response Sets and Response Styles. The tendency to choose response 
alternatives on the basis of social desirability is only one of several response sets that 
have been identified in self-report inventory responding (Lanyon & kënen 
1982, pp. 158-169). Although the voluminous literature on. the operation o 

response sets in personality inventories dates largely from the 1950s, the vs 
of response sets in both ability and personality tests Was observed by ear A 
investigators (see Block, 1965, chap. 2). One of the response sets that E d 
early attention was acquiescence, Or the tendency to answer es or es. 

Acquiescence is conceptualized as a continuous variable; at one end of ae e 
the consistent "Yeasayers" and at the other end the consistent *Naysayers" ( Sou à 
& Keniston, 1960). The implications of this response set for the construction o 
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personality inventories is that the number of items in which a "Yes" or "True" 
response is keyed positively in-any trait scale should equal the number of items in 
which a “No” or “False” response is keyed positively. This balance can be 
achieved by the proper selection or rewording of items, as was done in the PRF 
and is now being done with most new inventories.! 

- Another response set is deviation, or the tendency to give unusual or uncom- 
mon responses. Berg (1967) proposed this hypothesis and demonstrated its opera- 
tion with nonverbal content in a specially developed test requiring an expression 
of preference for geometric figures. Scales made up of items likely to be answered 
in one direction by almost all test takers, such as the Infrequency scale of Jack- 
son's PRF, were intended to identify such deviant response patterns. However, 
Jackson himself, among others, has pointed out that these scales tend to lack con- 
ceptual relevance to external criteria and, therefore, pose a problem especially in 
contexts like employment settings where the relevance of questionnaire items is 
considered important. Because of this, the Infrequency scale of the JPI was re- 
moved when that inventory was revised (Jackson, 1994a). The tendency to use 
the extreme choices on a rating scale (e.g., Is and 7s on a seven-point scale) has 
also been identified as a possible response bias (Paulhus, 1991). 

Research on response sets such as social desirability, acquiescence, and devia- 
tion has passed through several stages. When first identified, response sets were 
regarded as a source of irrelevant or error variance to be eliminated from test 
scores. Later, these response sets came to be regarded as indicators of broad and 
durable personality characteristics that were worth measuring in their own right 
(Jackson & Messick, 1958, 1962; J. S. Wiggins, 1962). At this stage, they were 
commonly described as response styles and an elaborate edifice of empirical data 
was built around them. Eventually, these data were challenged from many direc- 
tions (Block, 1965; Heilbrun, 1964; Rorer, 1965). Block (1965), for example, pre- 
sented strong evidence supporting a content-oriented interpretation of the two 
major factors generally found to account for most of the common variance in the 
MMPI scales, which exponents of response sets and response styles had inter- 
preted as social desirability and acquiescence. 

The controversy over response sets and content-versus-style in personality as- 
sessment has never been fully settled ( Edwards, 1990; Hogan & Nicholson, 1988; 
Jackson & Paunonen, 1980).!5 The majority of test developers and investigators 
seem to agree that personality inventory scores are likely to reflect a combination 
of self-deception, impression management, and realistic self-portrayal and that 
the weight of each of these components will vary with the individual and the oc- 
casion. Some, however, view attempts to improve the trustworthiness of self- 
report data through special scales and items as possibly counterproductive in that 
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they may reduce the validity of scales especially for normal, as opposed to patho- 
logical, samples. Such authors advocate the use of elinical skills in eliciting a pa- 
tient’s cooperation and in interpreting scores, as well as the inclusion of ratings 
from knowledgeable informants whenever there is reason to suspect serious dis- 
tortion (see, e.g., Costa & McCrae, 1992a). Í a 

Most other workers, especially those involved in the assessment of psy- 
chopathology, continue to use so-called "validity" scales, with the awareness that 
they may also reflect personality styles and characteristics. In fact, some of the 
newest and. technically more advanced instruments for the assessment of psy- 
chopathology, such as; Jackson's. BPI and Morey’s PAI, use balanced keying of 
items and special scales for detecting invalidating response sets. There is also a 
new variety of such scales, exemplified by the VRIN and TRIN scales of the 
MMPI-2 and MMPI-A, that make use of specially selected item pairs that are ei- 
ther similar or opposite in content to detect inconsistent or contradictory re- 
sponding. Because of the way, they are constituted, the VRIN and TRIN scales, 
which are similar to Greene’s (1978) Carelessness scale for the original MMPI, 
are not likely to be confounded. by valid personality trait variance (Ozer & Reise, 
1994). 

At any rate, the argument about response sets and styles has stimulated exten- 
sive research and has produced several hundred publications. Like many scientific 
controversies, its net effect has been to sharpen our understanding of method- 
ological problems and thereby improve the construction of personality invento- 
ries and their use in both research and applied settings: 


TRAITS, STATES, PERSONS, AND SITUATIONS 


Interactions of Persons and Situations. A long-standing controversy re- 
garding the generalizability of psychological traits versus the situational specificity 
of behavior reached a peak in the late 1960s and the 1970s. Several developments 
in the 1960s focused attention on narrowly defined “behaviors of interest” and 
away from broadly defined traits. In the area of abilities, this focus is illustrated by 
individualized instructional programs and domain-réferenced testing (chap. 3) 
and by the diagnosis and treatment of learning disabilities (chap. 17). In the 
personality realm, the strongest impetus toward behavioral specificity in testing 
came from the social learning and social cognitive theories that underlie behavior 
modification and behavior therapy (Bandura, 1969, 1986; Goldfried & Kent, 
1972; Mischel, 1968, 1969, 1973). Criticism was directed especially toward the 
early view of traits as fixed, unchanging, underlying causal entities. This kind of 
criticism had been anticipated in the earlier research and writing of several 
psychologists, with regard to all traits—cognitive as well as noncognitive (see 
chap. 11). Although few psychologists had really argued for this extreme Ui of 
traits, during the heyday of the situational specificity argument it was qe to S 
any who would identify herself or himself as a "trait theorist" (Jackson E 


Paunonen, 1980). 
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Situational specificity in particular is much more characteristic of personality 
traits than it is of abilities. For example, a person might be quite sociable and out- 
going at the office, but shy and reserved at social gatherings. Or a student who 
cheats on examinations might be scrupulously honest in money matters. An ex- 
tensive body of empirical evidence was assembled by Mischel (1968) and D. Pe- 
terson (1968) showing that individuals do exhibit considerable situational 
specificity in many nonintellective dimensions, such as aggression, social confor- 
mity, dependency, rigidity, honesty, and attitudes toward authority. Part of the ex- 
planation for the higher cross-situational consistency of cognitive than of 
noncognitive functions may be found in the greater standardization of the indi- 
vidual's reactional biography in the intellectual than in the personality domain 
(Anastasi, 1958, chap. 11; 1970, 19832). The formal school curriculum, for exam- 
ple, contributes to the development of broadly applicable cognitive skills in the 
verbal and numerical areas. Personality development, on the other hand, occurs 
under far less uniform conditions. In the personality domain, moreover, the same 
response may lead to social consequences that are positively reinforcing in one 
situation and negatively reinforcing in another. The individual may thus learn to 
respond in quite different ways in different contexts. Such dissimilarities in expe- 
riential history, across individuals as well as across situations and cultures, also 
lead to greater ambiguity in personality test items than is found in cognitive test 
items. Thus, the same response to a given question on a personality inventory— 
which can in itself be construed as a "situation"— may have different significance 
from one person to another. 

It should be noted that the trait-situation question is also related to the famil- 
iar heredity-environment question (D. C. Rowe, 1987). Hereditary influences are 
likely to be manifested through relatively enduring individual traits, which may 
nevertheless include a trait of adaptability to situational demands. Environmental 
influences may contribute to situational variance (or specificity) as well as to trait 
stability, because the individual's environment may itself exhibit major consisten- 
cies over time and situations. Appropriate experimental designs with repeated 
measurements obtained longitudinally and cross-situationally should enhance our 
understanding of both the heredity-environment and the trait-situation questions. 

Both the theoretical discussions and the research on person-by-situation inter- 
action have undoubtedly enriched our understanding of the many conditions that 
determine individual behavior, and they have contributed to the development of 
sophisticated research designs. Concurrently, there has been a growing consensus 
among the adherents of contrasting views to the effect that trait and situational 
explanations of behavior can coexist and that, in fact, behavior is determined 
through the interaction of trait and situational variables. The rapprochement was 
especially evident in a number of well-balanced and thoughtful discussions of the 
problem published from the late 1970s to the late 1980s.!? Several noteworthy 
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points emerged from these discussions. Behavior exhibits considerable temporal 
stability when measured reliably, that is, by summing repeated observations 
preferably made by multiple observers who are knowledgeable about the person 
being evaluated. When random samples of persons and situations are studied, in- 
dividual differences contribute more to total behavior variance than do situa- 
tional differences. Interaction between persons and situations contributes as 
much as do individual differences, or slightly more. To identify broad personality 
traits, we need to measure the individual across many situations, using dimensions 
that are publicly observable and behaviors that are relevant to the dimensions in 
question, and aggregate the results (Epstein, 1980; Kenrick & Funder, 1988). In 
spite of the relative consensus achieved on many of these matters, and of what has 
been learned already, there are still more questions and new questions about both 
sides of the person-situation debate (see, e.g., Funder, 1991). : 


The Person. The degree of behavioral specificity across situations itself varies 
from person to person. Individuals differ in the extent to which they alter their 
behavior to meet the demands of each situation. In this respect, moderate incon- 
sistency indicates effective and adaptive flexibility, while excessive consistency 
indicates maladaptive rigidity. Moreover, the particular situations across which 
behavior is consistent may vary among persons. This intersituational consistency 
is influenced by the way in which individuals perceive and categorize situations. 
And such grouping of situations, in turn, depends on the individual’s goals, motives, 
and feelings as well as on her or his prior experience with similar situations. 
Individual differences in consistency are of great interest to psychologists in 
general for obvious reasons. To the extent that such differences can be reliably as- 
sessed, they could be used as a moderator variable in the prediction of behavior. 
In addition, from the psychometric standpoint, intrapersonal and interpersonal 
differences in consistency are considered crucial in moderating the validity—and 
reliability—of all psychological measures. It is therefore not surprising that efforts 
are under way, from a variety of directions, to devise ways to assess these differ- 


ences, One method is based on ratings people assign to themselves on different 
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trait dimensions; scores are derived from each person's variance across ratings. 
tedness, which has, in 


Low interitem scale variances represent the quality of trai 
fact, been linked to higher validity coefficients (Amelang & Borkenau, 1986; 
Baumeister & Tice, 1988) 1A different approach is represented by Lannings 
(1991) notion of "scalability," defined as the degree to which a person maintains a 
normative ordering of behavioral items across situations as assessed by self-report. 

An intriguing proposal that links the concepts of person-situation interaction 


and social desirability, in terms of the process underlying responses to personality 
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questionnaire items, is Jackson's “threshold theory" of responding (see, e£» 


Helmes & Jackson, 1989; Jackson, 1986b). This model is based on the premise that 


inventory items represent à microcosm of behavior in the real world. It uses item 
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response theory methods to scale personality item content (see chap. 7). Jackson 
suggests that, just as the responses of many individuals to a single item can be used 
to derive an item characteristic curve, a single subject's responses to many items 
can be used similarly to plot a subject characteristic curve. Such a curve would 
predict a person's likelihood of endorsing items and would be based on the 
salience of the desirability dimension for the subject in question, her or his 
threshold or willingness to respond positively to items in terms of social desirabil- 
ity, and the desirability of the items themselves. 


The Situation. Situations also differ in the behavioral constraints they impose. 
Thus we could predict with a high level of confidence that readers will remain 
silent in a library and that motorists will stop at a red light. Similarly, persons— 
whatever their trait structure—are likely to swim at the beach and to read in the 
library. Nevertheless, certain individuals may spend their time reading while at 
the beach, and others may spend all too much time daydreaming about swimming 
while in the library. One way to better understand the constraints that situations 
impose on behavior is by studying the characteristics of various behavior settings. 
Roger Barkers recently updated work on ecological psychology provides a 
promising set of tools for classifying behavior settings and for describing various 
facets of the environment (Schoggen, 1989). 

Cross-cultural differences can be seen as a special, and more pervasive, in- 
stance of situational variability. As such, they afford a unique opportunity to 
study the consistencies and inconsistencies in the behavior of persons. To this 
end, the approaches to cross-cultural testing discussed in chapters 9 and 12 in the 
context of ability measures can also be used to investigate other psychological 
traits. 

When applied to cultures other than the one in which they were developed, 
personality tests show large differences. Any explanation of such cultural and sub- 
cultural differences requires specific knowledge of the conditions and circum- 
stances prevailing within each group. This has been increasingly recognized in 
the assessment of psychopathology in culturally diverse groups within the United 
States (Malgady, Rogler, & Costantino, 1987; Paniagua, 1994). Group differences 
on tests such as the MMPI, for example, can reflect nothing more than differences 
in interpretation of individual items or instructions. Cultural differences in the 
type of behavior considered socially desirable may likewise influence scores; ele- 
vated scores on a scale assessing depression in some groups could result from 
strong traditions of self-deprecation and modesty. In still other groups, high scores 
may indicate the prevalence of genuine emotional problems arising from child- 
rearing practices, conflicts of social roles, minority group frustrations, and other 
broad cultural differences. 

The issue of personality assessment across cultures is broader than the trans- 
portability of tests; rather, it involves the transportability of conceptual systems, 
such as traits and trait hierarchies, that are applied to behavior (Guthrie, Jackson, 
Astilla, & Elwood, 1983). Furthermore, just as was earlier discovered in the 
assessment of the ability domain, there is now a growing recognition that some 
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important personality dimensions are not universal. Accordingly, in addition to 
the ongoing efforts to adapt and translate traditional Anglo-American instru- 
ments for use with different populations, some personality measures designed to 
assess specific dimensions indigenous to cultural and subcultural groups are now 
being developed (see, e.g., Dana, 1993; Lonner & Berry, 1986). However, whereas 
some specific items and inventories may not be transportable across cultures, 
there is reason to believe that the hierarchical trait model may be useful in inte- 
grating cross-cultural findings. That is, provided one begins by measuring behav- 
iors identified as significant within each culture with instruments appropriate to 
each, it may be possible to identify some higher-level psychological constructs 
that may be universal or at least broadly generalizable across cultures (Anastasi, 
1992c; Diaz-Guerrero & Diaz-Loving, 1990). 


Traits and Situations. n That traits and situations are not incompatible ways 
of categorizing behavior is illustrated by self-report inventories for assessing test 
anxiety (1. G. Sarason, 1980). An example is the Test Anxiety Inventory?! (TAI) 
developed by Spielberger and his associates (Spielberger et al., 1980). This 
instrument is essentially a trait measure. The trait, however, is defined in terms of 
a specified class of situations, those centering on tests and examinations. Persons 
high in test anxiety tend to perceive evaluative situations as personally threatening. 
The inventory comprises 20 statements describing reactions before, during, or 
after tests or examinations. Respondents are asked to indicate how they generally 
feel by marking how frequently they experience each reaction (almost never, 
sometimes, often, almost always). Typical examples include “I freeze up on 
important exams” and “While taking examinations, ] have an uneasy, upset 
feeling.” The TAI yields a total score on anxiety proneness in test situations, as 
well as subscores on two major components identified through factor analysis— 
namely, worry and emotionality. In this context, worry is defined as "cognitive 
concerns about the consequences of failure" and emotionality as "reactions of the 
autonomic nervous system that are evoked by evaluative stress" (Spielberger et 
al., 1980, p. 1). i jw 
Still further degrees of situational specification are provided in the Test Anxi- 
ety Profile (Oetting & Deffenbacher, 1980).22 On this instrument, respondents 
rate their reactions on items covering feelings of anxiety and thought interfer- 
ence. Both types of anxiety scores are obtained for each of six testing situations in 
which respondents are instructed to imagine themselves, ranging from "multiple- 
choice test" and "unannounced ‘pop’ quiz” to “giving a talk. i 
Perhaps because of the prevalence of test anxiety, and the relative ease of inves- 


tigating it within the academic setting, research and theorizing on the topic con- 


tinue with undiminished vigor both in the United States and in other countries 
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(Hagtvet & Johnsen, 1992). At any rate, this construct provides a good model of 
how trait and situational concepts can be useful in categorizing behavior, espe- 
cially in the personality domain. Depending upon the purpose of the test, trait 
constructs may be defined with different degrees of breadth or narrowness and may 
be linked to specified types of situations. 


Traits and States. Another way to conceptualize the behavior domain 
assessed by personality tests involves a differentiation between traits and states. 
This differentiation is most clearly exemplified in the State-Trait Anxiety 
Inventory (STAI) developed by Spielberger and his co-workers (Spielberger, 
1985; Spielberger et al., 1983). In the construction of this instrument, state 
anxiety (S-Anxiety) was defined as a transitory emotional condition character- 
ized by subjective feelings of tension and apprehension. Such states vary in 
intensity and fluctuate over time. S-Anxiety is measured by 20 short descriptive 
statements which the individual answers in reference to how he or she feels at the 
moment (e.g., "I feel calm"; "I am jittery"). The answers are recorded by indicating 
the intensity of the feeling (not at all, somewhat, moderately so, very much so). 
Trait anxiety (T-Anxiety) refers to relatively stable anxiety-proneness, that is, 
the individual's tendency to respond to situations perceived as threatening with 
elevated S-Anxiety intensity. Respondents are instructed to indicate how they 
generally feel by marking the frequency with which each of the 20 statements ap- 
plies to them (almost never, sometimes, often, almost always). Examples of the 
statements are "I am inclined to take things hard" and "I am a steady person." In- 
dividuals high in T-Anxiety tend to exhibit S-Anxiety elevations more often 
than do individuals low in T-Anxiety, because they react to a wider range of situ- 
ations as threatening or dangerous. They are especially responsive to interper- 
sonal situations posing some threat to their self-esteem, such as performance 
evaluation or the experience of failure. Whether or not S-Anxiety increases in a 
given situation, however, depends upon the extent to which the individual per- 
ceives the situation as threatening or dangerous on the basis of her or his past ex- 
perience. The STAI and its companion version for children, the State-Trait 


` Anxiety Inventory for Children ( STAIC), have been translated into 43 languages 


and dialects and have accumulated a bibliography of more than 6,000 studies 
(Spielberger, 1989; Spielberger & Sydeman, 1994). 

The state-trait differentiation was applied by Spielberger and his associates in 
another subsequently developed inventory, the State-Trait Anger Expression In- 
ventory (STAXI—Spielberger, 1988; Spielberger, Johnson, Russell, Crane, Ja- 
cobs, & Worden, 1985). The STAXI comprises 44 items representing the 
domains of anger experience and anger expression. The domain of anger experi- 
ence is assessed by two scales that parallel those of the STAI, namely State Anger 
(S-Anger) and Trait Anger (T-Anger); T-Anger, in turn, has two subscales of four 
items each: Angry Temperament and Angry Reaction. The frequency of anger ex- 
pression is tapped by three eight-item scales: Anger-in, Anger-out, and Anger- 
Control. Reviews of this test by Biskin (1992) and Retzlaff (1992) can be found 
in the 11th MMY. 
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CURRENT STATUS OF PERSONALITY INVENTORIES 


The construction and use of personality inventories are beset with special difficul- 
ties over and above the common problems encountered in all psychological test- 
ing. The question of impression management is far more acute in personality 
measurement than in aptitude testing. The behavior measured by personality 
tests is also more changeable over time than that measured by tests of ability. The 
latter fact complicates the determination of test reliability, since random tempo- 
ral fluctuations in test performance are likely to become confused with broad, sys- 
tematic behavioral changes. Even over relatively short intervals, it cannot be 
assumed that variations in test responses are restricted to the test itself and do not 
characterize the area of nontest behavior under consideration. A related problem 
is the greater situational specificity of responses in the noncognitive than in the 
cognitive domain. 

The 1990s have witnessed a resurgence of research that faced up to the com- 
plexities of personality assessment and sought innovative solutions to these long- 
standing problems. The period is characterized by significant theoretical and 
methodological advances. 3 The earlier critiques of personality measurement un- 
doubtedly had a salutary effect and in part stimulated the subsequent develop- 
ments in this area of psychometrics. We must, however, guard against the danger 
that, in the zeal to eradicate fallacious thinking, sound and useful concepts may 
also be lost. The occasional proposal that diagnostic personality testing and trait 
concepts be completely discarded, for example, indicates an unnecessarily narrow 
definition of both terms. Diagnosis need not imply the labeling of persons, the use 
of traditional psychiatric categories, or the application of the medical, "disease" 
model. Diagnostic testing should be used as an aid in describing and understand- 
ing the individual, identifying her or his problems, and reaching appropriate ac- 
tion decisions. Similarly, traits refer to the categories into which behavior must 
necessarily be classified if we are to deal with it at all—in science or in any other 
context. The optimum category breadth will vary with the specific purpose of as- 
sessment. The hierarchical trait models presented in chapter 11 can be employed 
equally well in the description of noncognitive behavior. Under certain circum- 
stances, relatively broad traits will serve best; under other circumstances, narrow, 
specifically defined behaviors will need to be assessed. 
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Measuring Interests and 
Attitudes 


he nature and strength of one’s interests and attitudes represent an 
important aspect of personality. These characteristics materially affect 
educational and occupational achievement, interpersonal relations, the 
enjoyment one derives from leisure activities, and other major phases of daily 
living. Although tests typically are directed toward the measurement of one or 
the other of these variables, the available instruments cannot rigidly be classified 
according to such discrete Categories as interests and attitudes. Overlapping is the 
tule. Thus, a questionnaire designed to assess the relative strength of interests in 
investigative, artistic, or conventional occupations also might be said to gauge 
the individual’s attitudes toward pure science, art for art’s sake, practical tasks, 
and the like. 

Values are also clearly related to life choices and are often discussed in con- 
junction with interests, attitudes, and preferences, A great deal of research has 
been, and continues to be, done by personality and social psychologists on the 
topic of values, including some interesting and wide-ranging investigations of the 
generality of values across cultures (S. H. Schwartz, 1992, 1994; S. H. Schwartz & 
Sagiv, 1995; Super & Sverko, 1995). In recent years, however, relatively few de- 
velopments have occurred in the area of standardized, commercially available in- 
struments aimed exclusively at the assessment of values. This is related to several 
problems specific to the measurement of values, such as the difficulties of sam- 
pling systematically, and at the appropriate level of abstraction, from value do- 
mains. In addition, some of the early and most widely used value measures proved 
to be incompatible with the way values eventually came to be conceptualized in 
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the field (Braithwaite & Scott, 1991). While some "stand-alone" tests, such as 
the Life Values Inventory (Brown & Crace, 1996) and The Values Scale (Nevill 
& Super, 1989) are still being published or revised, much of the formal assessment 
of values is now incorporated within instruments designed to facilitate career de- 
cision making and to assess work-related attitudes and motives.” 

The study of interests received its strongest impetus from educational and ca- 
reer counseling, which will be discussed in chapter 17. To aslightly lesser extent, 
the development of tests in this area was also stimulated by occupational selec- 
tion and classification. From the viewpoint of both the worker and the employer, 
a consideration of the individual's interests is of practical significance. The assess- 
ment of opinions and attitudes originated largely as a problem in social psychology. 
Attitudes toward different groups, for example, have obvious implications for in- 
tergroup relations. Similarly, the gauging and prediction of public opinion regard- 
ing a wide variety of issues, institutions, or practices are of deep concern to the 
social psychologist, as well as to the worker in business, politics, and other applied 
fields. The measurement of opinions and attitudes has also made rapid strides in 
the areas of consumer research and employee relations. 

All instruments surveyed in this chapter are self-report inventories developed 
by one or more of the methods described in chapter 13. It should be noted, how- 
ever, that in this area, as in the measurement of all personality constructs, other 
procedures are being continually explored. Examples of these other procedures 
are discussed in chapters 15 and 16. 


INTEREST INVENTORIES: CURRENT SETTING 


The large majority of interest inventories? are designed to assess the individual's 
interests in different fields of work. Some also provide an analysis of interests in 
educational curricula or fields of study, which in turn are usually related to career 
decisions. Although the frequency of test use in counseling has remained fairly 
stable since the 1950s, the use of interest tests has increased relative to that of 
ability tests (Zytowski & Warman, 1982). The more recently developed or revised 


inventories reflect certain major changes in career counseling. One of these 
changes pertains to the increasing emphasis on self-exploration. More and more in- 
struments provide opportunities for the individual to study the detailed test re- 
sults and relate them to occupational information and other data about personal 
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qualifications and experience. Training in career decision making has itself re- 
ceived increasing attention. The occupational interest inventories to be discussed 
in this chapter should be considered against the background provided on career 
assessment in chapter 17. 

A second and related change concerns the goal of interest measurement. 
Today, there is more and more emphasis on expanding the career options open to the 
individual. In fact, the term “exploration validity” is used to designate the effect 
that interest inventories can have in terms of increasing behaviors instrumental 
to career exploration (see, e.g., Randahl, Hansen, & Haverkamp, 1993). Thus, 
the interest inventory, as well as the more comprehensive career orientation pro- 
grams cited in chapter 17, are being used to acquaint the individual with suitable 
occupations that he or she might not otherwise have considered. 

The third significant change is itself associated with this expansion of career 
options. It relates to a concern about the sex fairness of interest inventories. In 
general, interest inventories compare an individual's expressed interests with 
those typical of persons engaged in different occupations. This is done either in 
the scoring of individual item responses, or in the interpretation of scores in 
broad interest areas, or both. While this approach certainly represents an objec: 
tive, empirical procedure for evaluating one's interests, it tends to perpetuate ex- 
isting group differences among occupations. If there are large discrepancies in the 
proportion of men and women in some occupations, such as engineering or nurs- 
ing, these differences tend, in one way or another, to influence the interpretation 
of results obtained by males and females on interest inventories. For this reason, 
considerable discussion and intensive research efforts have been devoted to ways 
of reducing possible sex bias in interest inventories (Tittle & Zytowski, 1978; Zy- 
towski & Borgen, 1983). A set of guidelines for the assessment of sex bias and sex 
fairness in career inventories has also been prepared and widely distributed.* 
Nearly every newly developed or revised inventory shows the influence of these 
guidelines and makes some provision for dealing with the issue of sex fairness. 
One of the steps taken has been the elimination of sex bias in the wording of in- 
ventory items. In addition, other common solutions have involved balancing the 
content of items evenly in terms of typical differences in sex role socialization and 
providing the most appropriate norms possible for each sex group on inventory 
scales. These measures have improved the quality of technical data available to 
career counselors and other users of interest inventories. However, because of lin- 
gering sex differences in interest patterns and in occupational group membership, 
there are still social policy issues concerning the interpretation and use of interest 
inventory data that need to be carefully considered in the career assessment of 
women (Hackett & Lonborg, 1994). 

A related development has come to the fore even more recently, largely in 
response to legislation aimed at providing vocational guidance to disabled and 
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disadvantaged populations? (L. S. Gottfredson, 1986b; Reed, Rotatori, & Day, 
1990; Szymula, 1990). Among other things, the special requirements of persons 
with disabilities in terms of vocational assessment have prompted the develop- 
ment of several picture interest inventories that use drawings or audiovisual stim- 
uli such as slide-tape and videocassette formats (Elksnin & Elksnin, 1993; Kapes 
et al., 1994, pp. 307-345). The first generation of these instruments, designed to 
bypass the reading demands of traditional paper-and-pencil interest inventories, 
is inadequate from the psychometric standpoint, especially in the realm of valid- 
ity. However, picture interest inventories represent an innovative way of adminis- 
tering test stimuli that is likely to develop further as technology improves and the 
need to provide alternative measures for special populations expands. Modifica- 
tions of traditional instruments, such as the ones described in chapter 9 in con- 
nection with ability tests for special populations, can, of course, be used in the 
assessment of interests as well. The impetus provided by the Americans with Dis- 
abilities Act of 1990 (BL. 101-336) is certain to stimulate further interest and 
work in the area (Bruyère & O'Keeffe, 1994). 


THE STRONG INTEREST INVENTORY™ (SII) 


Origins and Development of the SIL. This interest inventory, whose latest 
edition was published in 1994, has a long history. The general approach followed 
in its construction was first formulated by E. K. Strong, Jr, while attending a 
1919-1920 graduate seminar on interest measurement at the Carnegie Institute 
of Technology (D. P. Campbell, 1971, chap. 11; Fryer, 1931, chap. 3). When first 
published in 1927, the Strong Vocational Interest Blank® (SVIB) introduced two 
principal procedures in the measurement of occupational interests. First, the 
items dealt with the respondent's like or dislike for a wide variety of specific 
activities, objects, or types of persons that he or she commonly encountered in 
daily living. Second, the responses were empirically keyed for different occu- 
pations. These interest inventories were thus among the first tests to employ 
criterion keying of items, subsequently followed in the development of such 
personality inventories as the MMPI and CPI (chap. 13). It was found that 
persons engaged in different occupations were characterized by common interests 
that differentiated them from persons in other occupations. These differences in 
interests extended not only to matters pertaining directly to job activities, but 
also to school subjects, hobbies, sports, types of plays or books the individual 
enjoyed, social relations, and many other facets of everyday life. It thus proved 
feasible to prepare an inventory that explored an individual's interests in familiar 


?Two specific instances of such legislation are the Carl D. Perkins Vocational Education Act of 1984 (P.L. 
98-524) and dii Carl D. Perkins Vocational and Applied Technology Education Act Amendments of 1990 
(PL. 101-392). 
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things and thereby to determine how closely her or his interests resembled those 
of persons successfully engaged in particular occupations. 

Beginning in the 1970s, extensive innovations have been introduced and im- 
plemented in successive revisions of the SII (D. P. Campbell, 1974; D. P. Campbell 
& Hansen, 1981; Hansen & D. P. Campbell, 1985; Harmon, Hansen, Borgen, & 
Hammer, 1994). The principal changes include (1) the introduction of a theoreti- 
cal framework to guide the organization and interpretation of scores, (2) the merg- 
ing of the earlier men's and women's forms and the renorming of all occupational 
scales on new male and female samples, and (3) a substantial increase in the num- 
ber of scales for vocational/technical occupations requiring less than a college de- 
gree for entry—a category that had been underrepresented in earlier editions. 


The SII-Form T317: General Description. The current Strong Interest 
Inventory consists of 317 items grouped into eight parts. In the first five parts, the 
respondent records her or his preferences by marking L, I, or D to indicate “Like,” 
“Indifferent,” or “Dislike.” Items in these five parts fall into the following cate- 
gories: occupations, school subjects, activities (e.g., making a speech, repairing a 
clock, raising money for charity), leisure activities, and day-to-day contact with 
various types of people (e.g., very old people, military officers, people who live 
dangerously). Two additional parts require the respondent to express preferences 
between paired activities (e.g., dealing with things vs. dealing with people) and 
between all the possible pairings of four items from the world of work (i.e., ideas, 
data, things, and people). Finally, a portion of the inventory asks the respondent 
to mark a set of self-descriptive statements “Yes,” “No,” or “?.” 

The Strong can be scored only by computer, at scoring centers designated by 
the publisher or with the use of software available from the publisher in a variety 
of options. Figure 14-1 shows the first page of the latest Strong profile, which pro- 
vides a “snapshot” or summary of the test taker's highest scores on the main scales 
of the inventory. There are three levels of scores, differing in breadth. The 
broadest and most comprehensive are the six General Occupational Theme 
scores; the next subdivision includes 25 Basic Interest Scales; and the most spe- 
cific level provides the 211 available Occupational Scales. In addition to these, 
Form T317 of the Strong produces scores on four new Personal Style Scales that 
assess preferences in Work Style, Learning Environment, Leadership Style, and 
Risk Taking/Adventure. The profile form also provides a set of Administrative In- 
dexes, including total number of. responses (to detect excessive omissions), num- 
ber of infrequent or unusual responses, and the response percentage for each of 
the three response options on each of the eight parts of the Strong inventory. 


These indexes can be used as checks on careless responding or special response 
sets. 


i due SIL profile has been redesigned and now consists of six pages: The front of each page presents the 
indivi lual bs taker's scores on all the scales, and the back of each page has general information about the 
meaning of scores and suggestions on how to proceed with career exploration. In addition to the profile, more 


extensive and qualitative narrative reports with interpretations tailored to the scores of the individual respon- 
dent can also be obtained. 
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The SII's classification of occupational interests is derived from the theoretical 
model developed by John Holland (1966, 1985/1992) and supported by extensive 
research both by Holland and by other, independent investigators. The General 
Occupational Themes identified by Holland model are designated Realistic (R), 
Investigative (1), Artistic (A), Social (S), Enterprising (E), and Conventional 
(C)." Each theme characterizes not only a type of person but also the type of 
working environment that such a person finds most congenial. Each of these en- 
vironments tends to be populated and dominated by persons of the corresponding 
type. According to Holland, persons are not classified rigidly into one of the six 
major types; rather, they are characterized by degree of resemblance to one or 
more types. Such combinations of types, ordered by degree of resemblance, thus 
provide a multiplicity of patterns or "codes" for describing the wide diversity of 
individual differences. 

Figure 14-2 presents the six themes at the corners of the hexagonal model de- 
veloped by Holland. The first letters of each of the themes are customarily used to 
designate them for the sake of brevity. Thus, the order in which the themes fall 
around the hexagon forms the acronym R-I-A-S-E-C, which in turn is used to 
designate the model. It will be noted that the highest correlations were obtained 
between theme scales occupying adjacent positions along the perimeter of the 
hexagon. The lowest correlations were found between scales at opposite ends of 
diagonals. For example, the Realistic Scale correlates .53 with the Investigative 
Scale, but only .04 with the Artistic and .06 with the Social Scale. Similarly, if 
the Occupational Scales are plotted on the hexagon, most of them follow the ex- 
pected order along the perimeter. For example, Engineer, coded RI, falls between 
Realistic and Investigative; Banker, coded CE, falls between Conventional and 
Enterprising. Usually, occupations that score high on one theme score low on its 
direct opposite (e.g., Artistic and Conventional). When an Occupational Scale 
shows substantial correlations with themes at Opposite points of the hexagon, it 
often includes heterogeneous subgroups with recognizably different occupational 
functions. 

The 25 Basic Interest Scales are classified under the six General Occupational 
Themes. These scales consist of clusters of substantially intercorrelated items. 
The Basic Interest Scales are more homogeneous in content than the Occu- 
pational Scales and can therefore help in understanding why an individual scores 
high on a particular Occupational Scale. 

The Occupational Scales, which constituted the main body of the original 
SVIB, are now grouped under the appropriate General Occupational Themes. In 
the continuing research program on this inventory, new scales have been added 
over the years and old scales have been updated with fresh criterion samples. 
Form T317 includes 211 Occupational Scales, of which 83 percent were normed 


These themes overlap considerably with the evaluative attitudes assessed by the Study of Values (Allport et al., 
1960) which, in turn, were suggested by Spranger's Types of Men ( 1928). 
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Realistic 53 Investigative 


Conventional 


Figure 14-2. Holland’s Hexagonal Model of General Occupational Themes. For 
each pair of themes, correlation is based on the General Reference Sample of 9,484 men 
and 9,467 women. 


(Reprinted with permission of the publisher from p. 51, Strong Interest Inventory: Applications and technical guide, 
by Lenore W. Harmon, Jo-Ida C. Hansen, Fred H. Borgen, and Allen L. Hammer. Copyright © 1994 by Stan- 


ford University Press.) 


on samples obtained in the 1980s and 1990s. For all but seven of these scales, it 
proved feasible to locate enough respondents to develop both female-normed and 
male-normed scales (i.e. 102 scales for each sex). Thus far, five scales are normed 
only on women; two, only on men. ; 
Most of the samples employed in developing each Occupational Scale contain 
200 or more persons, although the actual number varies from 60 to 1,187. For the 
1994 revision, more than 55,000 persons were tested, and slightly fewer than 
40,000 of them met the required specifications for use in scale development. The 
occupational criterion groups consisted of persons mostly between the ages of 25 
and 60, employed in the given occupation for at least three years, who reported 
satisfaction with their work and were performing the duties typical of members of 
the occupation. There are also two General Reference Samples (GRSs), includ- 
ing 9,484 men and 9,467 women, all of whom were tested in the 1990s. The 
GRSs comprise 98 different occupations, of which 90 are represented by 200 
members, randomly selected from the pool available for each occupation; eight 
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occupations are represented by samples of between 92 and 195 members only? 
The educational attainment levels of the individuals in the Strong GRSs are far 
higher than those of the general population of the United States; in fact, approx- 
imately 80 percent of them hold a Bachelor's, Master's, or higher degree (Harmon 
et al., 1994, p. 110). This proportion of college graduates exceeds comparable fig- 
ures in the general population even for people who have the types of professional 
and white-collar jobs that constitute the vast majority of the occupations in the 
Strong GRSs. Thus, the latest edition of the Strong, like its predecessors, may be 
criticized for the lack of representativeness of its samples (Worthen, 1995). 

The items of the SII were placed into scales through two different methods. 
Those in the General Occupational Theme and Basic Interest Scales were 
grouped into homogeneous clusters on the basis of similarities in their content 
and in the way people responded to them; as gauged by factor analysis. On the 
other hand, items for each Occupational Scale were selected and weighted on the 
basis of differences in item-response percentages between the occupational crite- 
tion sample and the GRS for each sex. In the scale for female accountants, for ex- 
ample, a +1 weight indicates that the response occurs more often, and a —1 that it 
occurs less often, among female accountants than among women in general. Re- 
sponses that failed to differentiate substantially between female accountants and 
women in the GRS do not appear in the female accountant scale, regardless of 
how often they were chosen by accountants. The total raw score on each Occupa- 
tional Scale is the algebraic sum of the respondent's plus and minus weights. 


Scoring and Interpretation. All scores on the Strong inventory are reported 
as standard scores with a mean of 50 and an SD of 10. For both General 
Occupational Themes and Basic Interest Scales, the normative sample—from 
which standard scores are computed—is the combined-sex GRS (N = 18,951). 
Both the interpretive labels and graphic representations of scores on the profiles, 
however, are based on the same-sex norms. Comparisons with other-sex norms 
can also be made by reference to the graphs presented on the profile forms. On 
the Occupational Scales, each respondent actually receives two standard scores— 
one derived from the female occupational sample, the other from the male 
occupational sample. These reporting procedures, while directed primarily to 
same-sex comparisons, provide the necessary data for counselors and respondents 
to make cross-sex comparisons for fuller and more effective interpretation of 
response patterns in individual cases. 

: The SII Applications and Technical Guide (Harmon et al., 1994) is of outstand- 
ing quality and contains much information to aid the counselor in interpreting 
results and discussing their possible implications with the client. For instance, 
profiles yielding internally consistent results tend to have higher predictability. 
On the other hand, certain inconsistencies, as between Basic Interest Scales and 
Occupational Scales, provide useful insight into the nature and sources of the 


For a complete listing of all the occupational samples used in the 1994 revision of the SII, see Harmon et al 
(1994, Appendix A). : 
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respondent's expressed preferences. From another angle, the results obtained with 
the inventory can be generalized to other, related occupations through linkages 
that have been established with the occupations listed in the Dictionary of Occu- 
pational Titles (DOT) developed by the U.S. Department of Labor (1991) and 
other similar reference materials (Gottfredson & Holland, 1989; Harmon et al., 
1994, Appendixes A & B). The Applications and Technical Guide also has several 
chapters devoted to the use of the Strong with special populations, such as cultur- 
ally diverse groups and people with disabilities. In addition, several other publica- 
tions designed to help both clients and counselors understand and use the results 
of the Strong are available (see, eg, Borgen & Grutter, 1995; Hirsh, 1995; 
Prince, 1995). 


Psychometric Evaluation. The Strong has undergone a continuing research 
program that has yielded extensive data about its reliability and validity (D. P. 
Campbell, 1971, 1977; Hansen & Campbell, 1985; Harmon et al., 1994). For the 
Occupational Scales, the median retest reliability for a sample of 191 employed 
adults, after intervals of between three and six months, was .90; for the Basic 
Interest Scales, the corresponding reliability was .86; and for the General 
Occupational Themes, it was 89. Long-term stability of the Occupational Scales 
for previous editions of the Strong is also high. The correlations for periods 
extending up to 20 years fall mostly in the .60s and .70s for individuals under the 
age of 25 and in the .80s for those over 25. 

Concurrent validation data are provided by the degree of differentiation 
among various occupational samples, and between occupational samples and the 
reference samples. For the 211 occupational samples included in the 1994 revi- 
sion, the median overlap was 36 percent; this reflects an average separation of 
slightly under two standard deviations between the occupational scales and the 
GRSs. Predictive validity has been checked in several samples over long intervals 
for previous versions of the Strong. The evidence indicates substantial correspon- 
dence between initial occupational profile and the occupation eventually pur- 
sued. A specific example is provided by a 40-year follow-up of a sample of 


psychologists whose professional careers revealed some suggestive relations be- 


tween the flatness versus distinctness of their original profiles and such occur- 
rences as frequency of job changes and shifting from teaching or research to 
administrative or applied work (Vinitsky, 1973). Another investigation found 
strikingly high cross-cultural profile similarities among samples of psychologists 
tested in nine Western nations (Lonner & Adams, 1972). In spite of the exten- 
sive research base that the Strong inventories have accumulated over the years 
and of their apparent solidity as indicators of occupational choice and tenure, 


new studies using the latest revision need to be undertaken in order to investigate 


its predictive validity directly. d 
With regard to construct validation, the relations of the Occupational Scales 


to the General Occupational Themes, as well as the interrelations of the themes 
portrayed in Figure 14-2 among themselves, are of special relevance. The R-L-A- 
S-E-C model developed by Holland (1966, 1985/1992) has been of enormous 
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heuristic value in the study of vocational interests and the Strong inventories 
have in turn been an integral part of that research tradition, along with the other 
interest measures that utilize the same scheme. Research findings using the 
Strong, as well as other inventories, generally have been quite consistent with the 
predictions of the model. 

The structure and organization of the SII have allowed it to be modified and 
expanded in a number of ways through its continuing revisions. The Strong in- 
ventory is not only a pioneer in the field of interest measurement; it is also the 
most widely used among counseling psychologists in the United States (Watkins; 
Campbell, & Nieberding, 1994). A few other instruments have been in use nearly 
as long and almost as widely. The period since the 1960s, however, has witnessed 
a rapid growth in the development of new instruments in this area. In part, this 
development reflects the growing attention to career exploration and the recogni- 
tion that interests play a key role in such exploration. 

The comprehensive programs for career exploration discussed in chapter 17 
typically include measures of occupational interests, which are used in combina- 
tion with multiple-aptitude test scores and job information. For example, interest 
data elicited with the Career Interest Inventory (CII) can be used in conjunction 
with the Differential Aptitude Tests, with which it was standardized (Psychologi- 
cal Corporation, 1991a). Similarly, the Occupational Aptitude Survey and Inter- 
est Schedule-2 (OASIS-2— Parker, 1991a, 1991b) was designed to assist high 
school students in vocational exploration and career development by providing 
scores on 6 aptitude and 12 interest factors. Another variation in the area of ca- 
reer planning is exemplified by the Harrington-O'Shea Career Decision-Making 
System-Revised (CDM-R—Harrington & O'Shea, 1993), which attempts to in- 
tegrate self-reported data on interests, values, and abilities with job information. 
The Campbell Interest and Skill Survey (CISS—Campbell, Hyne, & Nilsen, 
1992) also measures self-reported interests and skills and is organized in a way 
reminiscent of the Strong inventory, with which David P. Campbell—author of 
the CISS—was involved for quite some time. The addition of data on skills al- 
lows comparisons of high and low patterns of scores on the interests and skills 
scales; this, in turn, expands the basis for career exploration and decision making 
provided by the survey. Reviews of the Cll, OASIS-2, CDM-R, and CISS can be 
found in Kapes et al. (1994). 


INTEREST INVENTORIES: OVERVIEW 
AND SOME HIGHLIGHTS 


Among the many currently available interest inventories, four have been chosen 
for individual mention because each illustrates some noteworthy feature in its 
theoretical orientation, its methodology, or the type of population for which it 
was designed. No attempt will be made to describe or evaluate these instruments 
in detail; all have been reviewed recently in the Mental Measurements Yearbooks, 
either in their current or former versions. 
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Jackson Vocational Interest Survey (JVIS). The JVIS (Jackson, 1977) is 
selected for special attention—first, because it exemplifies sophisticated test 
construction procedures, and second, because in several respects its approach 
contrasts sharply with that followed in the SII. The original publication dates of 
the Strong and the Jackson inventories are 50 years apart. The Strong focuses on 
specific occupations, both in item selection and normative interpretations; the 
Jackson uses broad interest areas in both item development and scoring system. 
The Strong occupational scales represent an extreme example of empirical 
criterion keying and criterion-related validation; the Jackson exemplifies 
construct validation at every stage in its development. In the Strong, the large 
majority of items are independently marked Like, Indifferent, or Dislike by the 
respondent; in the Jackson, all items are of the forced-choice type. 

As in the development of the Personality Research Form and the Jackson Per- 
sonality Inventory described in chapter 13, the first step in the development of 
the JVIS was to define the constructs or dimensions to be measured. These di- 
mensions, chosen on the bases of published research on the psychology of work 
and of factor-analytic and rational classifications of vocational interest items, 
were of two types. One was defined in terms of work roles, the other in terms of 
work styles. Work roles pertain to what a person does on the job. Some of these 
roles are closely associated with a particular occupation or type of occupation, 
such as engineering, law, or elementary education. Others, such as human- 
relations management and professional advising, cut across many occupations. 
Work styles refer not to job-related activities, but to preferences for working envi- 
ronments or situations in which a certain kind of behavior is expected. Work 
style dimensions are typically related to a person's values either directly or indi- 
rectly. Examples of work styles include planfulness, independence, and dominant 
leadership. 

Development of the inventory followed several steps, including successive try- 
outs and statistical analyses of items that had been prepared to fit detailed specifi- 
cations for each of the work roles and work styles. Starting with an initial pool of 
over 3,000 items, originally presented singly for a like-dislike response, the proce- 
dure included factor analyses of subsets of the items prepared for each scale. The 
presence of response bias, reflected in a large general factor, was removed statisti- 
cally before proceeding with further item analyses. Items were selected if they 
showed high correlations with total factor scores on their own scale and low cor- 
relations with other scales. Pairs of items that represented different work roles or 
work styles and had shown, similar endorsement frequencies when they were pre- 
sented singly were assembled for the forced-choice format . i 

The final form of the JVIS contains 34 basic interest scales, covering 26 work 
roles and 8 work styles. The inventory was designed to be equally applicable to 
both sexes, although separate percentile norms for female and male subgroups are 
also available. Norms were derived from large samples of college and high school 


?However, in contrast to other forced-choice instruments, the pairing of the JVIS items was accomplished in 
such a way that the resulting scores are not ipsative in nature. 
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.students in the United States and Canada. A high score on any of the 34 JVIS 
basic interest scales indicates an interest in the things people do in a particular 
field of work, as well as in the way people in that work context are expected to 
act. 

The JVIS can be hand-scored quickly and conveniently for the 34 scales. How- 
ever, the computer-based scoring options available utilize more recent norms and 
provide several additional score analyses either in a short report or in a newly re- 
vised extended narrative report. The latter gives individualized descriptive and 
interpretive material as well as a good deal of information to aid in career explo- 
ration. For example, the computer-generated reports include scores derived from 
a factor analysis of the 34 basic interest scales. These scores, modeled after Hol- 
land's six themes, include the following 10 General Occupational Themes: 
Expressive, Logical, Inquiring, Practical, Assertive, Socialized, Helping, Conven- 
tional, Enterprising, and Communicative. A measure of Academic Satisfaction 
and indices of Response Consistency and Infrequency, as well as a tally of Unscor- 
able Responses, is also provided. Other available score analyses involve compari- 
sons of the respondent's profile as a whole with modal profiles obtained by college 
students in 17 clusters of academic major fields and by persons engaged in 32 oc- 
cupational clusters. The occupational profiles were derived from joint administra- 
tions of the JVIS and Strong inventories which, through statistical analyses, 
established a linkage between the two inventories. This linkage provided a means 
of using the extensive occupational database of the Strong in interpreting JVIS 
scores. 

Some reviewers have suggested that the wording of the JVIS items may be too 
sophisticated for many high school students (D. T. Brown, 1989; J. W. Shepard, 
1989). The survey item booklet has recently undergone a few minor changes in 
wording, but these are related to updating the terminology used rather than to 
simplifying it. A revision of the manual, incorporating norms gathered in the 
mid-1990s, is also forthcoming shortly. In the meantime, a handbook on applica- 
tions of the JVIS to career counseling (Verhoeve, 1993) and an occupations guide 
(Jackson, 1995) are available from the publisher. 


The Kuder Occupational Interest Survey and Its Predecessors. The 
interest inventories developed by Frederic Kuder have been in use almost as long 
as the Strong series. The earliest was the Kuder Preference Record- Vocational, 
whose approach to the measurement of interests differed from the Strong's in two 
major ways. First, Kuder used forced-choice triad items, in which the respondents 
indicated which of the three activities they would like most and which least. 
Second, scores were obtained not for specific occupations, but for 10 broad 
interest areas—namely, Outdoor, Mechanical, Computational, Scientific, Persua- 
sive, Artistic, Literary, Musical, Social Service, and Clerical. Items for each scale 


10For a full description of the procedures that were followed, see Jackson (1977, chap. 4) and Jackson and 
Williams (1975). R : 
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were formulated and tentatively grouped on the basis of content validity; final 
item selection was based on internal consistency and low correlations with other 
scales. The Kuder General Interest Survey (KGIS) was developed later as a 
revision and downward extension of the Kuder Preference Record- Vocational. 
Designed for Grades 6 to 12, this form uses simpler language and easier 
vocabulary, requiring only a sixth-grade reading level. Reviews of the KGIS 
(M. Pope, 1995; D. Thompson, 1995) can be found in the 12th MMY. 

A still later version, the Kuder Occupational Interest Survey (KOIS), provides 
scores with reference to specific occupational groups, as does the Strong inven- 
tory (Kuder, 1966; Kuder & Diamond, 1979; Kuder & Zytowski, 1991). Unlike 
the Strong, however, the KOIS does not employ a general reference group. In- 
stead, the respondent's score on each occupational scale is expressed as a correla- 
tion between her or his interest pattern and the interest pattern of the particular 
occupational group.!! This interest survey can be scored on site or through the 
publisher by means of optical scanning, and it can also be administered and 
scored on a computer. Scores are currently available for 109 specific occupational 
groups and 40 college majors. Some scales have been developed only on men, 
some only on women, and some on both. Scores on all scales, however, are re- 
ported for both female and male respondents. The occupations covered by this in- 
ventory vary widely in level, ranging from beautician and truck driver to chemist 
and lawyer. 

Through intensive statistical analysis of the scores of 3,000 persons (100 in 
each of 30 core groups representative of the occupations and college-major fields 
covered by the inventory), Kuder demonstrated that better differentiation be- 
tween occupational groups can be achieved with the scoring system employed in 
this survey than with the occupational scales derived through the use of a general 
reference group. Research on other occupational scales is continuing (Zytowski, 
1992). 

The KOIS now provides both occupational scores and 10 broad, homogeneous 
basic interest scores, labeled Vocational Interest Estimates (VIE). The VIE are 
percentile scores derived from short scales equivalent to the 10 interest area 
scores of the early Kuder Preference Record. They can also be converted to the 
Holland R-I-A-S-E-C theme codes by direct correspondence for some scales and 
by averaging percentiles on two or three Kuder scales for others (e.g., average of 
Kuder Artistic, Literary, and Musical for Holland Artistic). Reviewers have gen- 


erally praised the technical characteristics of the KOIS while pointing to the 


` dearth of predictive validity data on this instrument. The failure to address the ef- 


fects of its forced-choice format on scores has also been criticized (see Herr, 1989; 


Tenopyr, 1989). 


_ cma init trn 
E correlation employed is the lambda coeffici 

iserial correlation adjusted for differences in homogeneity P r, 
able consists of the marked versus unmarked responses on the individual’s answer sheet; the continuous variable 
is the proportion of persons in the criterion grouP marking each response. 
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‘Career’ Assessment Inyentory—The Vocational Version (CAI-VV).'? 
First released in 1975, the CAI (Johansson, 1984) is patterned closely on the 
Strong inventory. In contrast to most interest inventories, however, it was 
designed specifically for persons seeking a career that does not require a four-year 
college degree or advanced professional training. It concentrates on skilled trades, 
technical work, and service occupations. Examples of the currently available 
occupational. scales are aircraft mechanic, dental hygienist, cafeteria worker, 
computer programmer, and registered nurse. The 305 inventory items are grouped 
under three content categories: Activities, School Subjects, and Occupations. 
Each item provides five response options, from “like very much” to “dislike very 
much.” Written at a sixth-grade reading level, the CAI can also be used with 
adults who have poor reading skills. Like the Strong, the CAI provides scores on 
three major types of scales, including the 6 Holland General Theme scales, 22 
homogeneous Basic Interest Area scales, and 91 Occupational scales. Administra- 
tive Indices and four Nonoccupational scales are also included. 

Although the procedures followed in developing the CAI were closely similar 
to those used in the Strong, all data gathering and statistical analyses were carried 
out independently for this inventory. Except for the General Theme scales, the 
particular scales developed in each category are thus specific to the CAI. For re- 
views of the CAI-VV, see Kehoe (1992) and Vacc (1992). 


Self-Directed’ Search (SDS). Another approach to the assessment of 
occupational interests is illustrated by the Self-Directed Search (SDS). This 
instrument was developed by Holland, whose hexagonal model of general 
occupational themes (discussed earlier in. this chapter) has attracted wide 
attention and has been incorporated in several current inventories (Holland, 
1985/1992; Holland, Fritzsche, & Powell, 1994; Holland & Gottfredson, 1976; 
Holland, Powell, & Fritzsche, 1994), 

As its title implies, the SDS was designed as a self-administered, self-scored, 
and self-interpreted vocational counseling instrument. Although organized 
around interests, the procedure also calls for self-ratings of abilities and reported 
competencies. The individual fills out the Self-Assessment Booklet, scores the re- 
sponses, and calculates six summary scores corresponding to the themes of the 
Holland model (Realistic, Investigative, Artistic, Social, Enterprising, and Con- 
ventional). The three highest summary scores are used to find a three-letter code. '? 
An accompanying booklet, the Occupations Finder, is employed to locate, among 
1,335 occupations, those whose codes resemble the respondent's summary code. 


ee versions of the CAT are currently available, namely, The Vocational Version (VV) and The Enhanced 
Versions (EV). The description in this section deals only with the VV. The EV, although very similar in structure, 
is a completely separate instrument (Johansson, 1986) that is applicable to a greater number and range of occu- 
pations, including many that require postsecon: gy education: 
P Although the SDS is designed to be self-scori S mu. 
"scoring, the manual recommends some supervision and checking o! 
diced t d S bad. de o oe ted individuals of various ages who took the current edition of the e 
EE lerived codes. containing. i l iti d, Powell, 
Fritzsche, 1994, p. 16). ining, incorrect letters or transpositions (Holland, Pow: 
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These occupations were chosen so as to represent nearly all of the workers in the 
United States; with a conversion code, the user can also explore all occupations 
in the Dictionary of Occupational Titles. Additional instructions, procedures, and. 
sources of information are provided to facilitate the individual's career decisions. 

The SDS is widely used in a variety of settings and has generated considerable 
research, both by the author and by independent investigators. In the years since 
its publication, it has undergone repeated revisions to simplify procedure and re- 
duce sex bias in career decisions. Its chief practical appeal stems from its brevity 
and simplicity, its do-it-yourself feature, and its role in expanding the individual’s 
career options. In addition to the regular form of the inventory (Form R), three 
other versions of the instrument are available: (1) Form E (Easy), developed for 
individuals with limited reading skills; (2) Form CP (Career Planning), designed 
for adults who are in the midst of career transitions; and (3) the Career Explorer 
version, aimed at middle school and junior high school students. 

With regard to the psychometric properties of the SDS, indices of reliability 
are generally: satisfactory for the summary scores. Construct validation of the 
basic six themes relies principally on the research that led to the formulation of 
the themes and on subsequent confirmatory factor analytic studies that are mostly 
supportive (see, e.g. Oosterveld, 1994). The concurrent validity and predictive 
efficiency of the SDS fluctuate depending on the makeup of samples in terms o 
age, sex, educational level, and distribution of types (Holland, Fritzsche, & Pow- 
ell, 1994). Griticisms of the SDS center around some of its scoring and interpre- 
tive procedures (M. H. Daniels, 1989; Manuele-Adkins, 1989). Nevertheless, the 
consensus of reviewers’ opinions is that this instrument provides a simple, inex- 
pensive, and relatively accurate way to explore vocations. A bibliography for the 
SDS and several informative booklets for use in conjunction with it are available 
from the publisher. 

J. L. Holland (1966, 1985/1992) aligns himself clearly with those who regard 
occupational preferences as the choice of a way of life—a choice that reflects the 
individual's self-concept and major personality characteristics. Each of Holland s 
occupational themes corresponds to a "type" or cluster of personality attributes. A 
given individual may be described in terms of one or more predominant types. 
The themes also correspond to model environments in terms of which different 
occupational settings may be characterized. These environments comprise not 
only physical features and work demands but also the kinds of persons with whom 
the individual works (co-workers, supervisors, customers, clients, students). Ac- 
cording to Holland, individuals seek environments that are congruent with their 
personality types; and such congruence enhances work satisfaction, job stability, 
and achievement. 

From a different perspective, Holland’s general approach. to the assessment of 
vocational interests is in line with certain developments in the psychology of ca- 
reer decisions. Super (1953, 1957, 1990) has repeatedly maintained that Xogar. 
tional choices are the pointe ee of ates There is a n ond 
research tradition dealing with persona ity differences among occupationa 
(see, e.g., Borgen; EC gett, Fo & Holland, 1984; Osipow, 1973, chap. 6; 
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Pietrofesa & Splete, 1975, chap. 4; Super & Bohn, 1970, chap. 5). Occupational 
choice often reflects the individual's basic emotional needs. And occupational ad- 
justment is a major component of general life adjustment (Tait, Padgett, & Bald- 
win, 1989). The assessment of vocational interests—and more specifically the 
identification of those occupational groups whose interests and attitudes an indi- 
vidual shares most closely—thus becomes a focal point in the understanding of 
different personalities.!^ 


SOME SIGNIFICANT TRENDS 


Inventory Development and Use. Among the clearest developments ap- 
parent in current interest measures are the merging of the two major theoretical 
positions in vocational psychology, as well as of the different approaches to 
inventory construction, and the cross-utilization of empirical data banks for 
interpretive purposes. More and more instruments are providing scores on both 
homogeneous, broad interest scales and specific occupational scales. The six oc- 
cupational themes of Holland's model have surfaced in an increasing number of 
recently developed or revised inventories. 

From a different angle, linkages are being established with empirical occupa- 
tional data available on large populations. This was illustrated by the use of 
Strong occupational data in the interpretation of results on the Jackson Voca- 
tional Interest Survey. It is also illustrated in the linkage of many current inven- 
tories, such as the Holland SDS and the SII, with the data provided in the 
Dictionary of Occupational Titles (U.S. Department of Labor Employment and 
Training Administration, 1991). These are promising developments which 
strengthen and enhance the usefulness of any single instrument. The effects 
should be beneficial, provided that linkages are established by sound psychomet- 
ric procedures and that steps are taken to guard against possible overgeneraliza- 
tion in score interpretation. 

Another characteristic of newly developed or recently revised interest inven- 
tories is the expansion of the occupational levels they cover. At the outset, inter- 
est inventories focused on professional careers, with a smattering of occupations 
requiring less than a college or professional school education. Although some 
early efforts were made to expand coverage downward (e.g., Clark, 1961), the re- 
sulting inventories were not widely used. In contrast, we now find the latest revi- 
sion of the Strong offering a substantial number of scales for vocational/technical 
occupations that do not require a college degree for entry. Furthermore, some 
more recently developed tools, such as the vocational version of the Career 
Assessment Inventory (Johansson, 1984) and the Career Directions Inventory 


"Waller, Lykken, and Tellegen (1995) report some pertinent findings concerning the interrelationships be- 
tween occupational interests, leisure time interests, and personality traits in a large sample of participants in the 
Minnesota Twin Registry. Their studies suggest that while personality traits clearly mediate the person- 
occupation fit, the three sets of characteristics can be conceptualized as distinct domains. 
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(Jackson, 1986), were designed principally for persons whose education does not 
extend beyond secondary or technical schools. Such developments probably re- 
flect, at least in part, a growing recognition of the importance of effective career 
choices at all occupational levels, as well as the crucial role of interests in success- 
ful and personally satisfying work experience in all types of jobs. 

Still another discernible trend has implications that extend beyond interest in- 
ventories to other kinds of testing. In a provocative look into the future of inter- 
est inventories, J. L. Holland (1986) referred to the growing recognition of such 
inventories as intervention techniques. What effects does the inventory have on 
the test taker? For example, it can support and strengthen existing vocational as- 
pirations in one person. In another, it can stimulate a comprehensive exploration 
of the world of work, with attention to hitherto unconsidered options. In still an- 
other, it can provide increased self-understanding. Such diversity of possible ef- 
fects may be reflected not only in individualized score interpretations, but also in 
the administration and construction of interest inventories. This may represent 
another likely application of computerized adaptive testing. 

In the 1990s, the psychology of vocational choice and career development has 
been enriched by the application of paradigms from the field of cognitive psychol- 
ogy (see, e.g., Peterson, Sampson, & Reardon, 1991). Viewing vocational decision 
making as a problem-solving activity, likely to recur through the lifespan, has 
underscored the need to provide individuals with the knowledge and information- 
processing skills required for arriving at optimal solutions. Many new instruments 
are embedded within comprehensive programs for career exploration that are a log- 
ical extension of Super's views on career development and of Holland's approach to 
vocational choice. It has become increasingly clear that the area of vocational 
behavior offers unparalleled opportunities for engaging individuals actively and 
directly in the use of the insights and tools of psychology (Borgen, 1991). 


Models of Occupations. Most of the vocational measures discussed in this 
chapter have either stemmed from Holland’s model of occupational themes (e.g., 
the CAI and SDS) or have incorporated it into their procedures to a greater or 
lesser extent (e.g the SII and KOIS). In fact, during the past two decades, 
Holland's theoretical postulations have played a preeminent role in stimulating 
research in the field of vocational psychology, not only in the United States but 
also in other parts of the world (see, e-g., Borgen, 1991; Lokan & Taylor, 1986). In 
an unusually comprehensive test of the model, Tracey and Rounds (1993) 
conducted a structural meta-analysis of the R-LA-S-E-C scales across a variety of 
instruments; this study, which used 104 U.S. samples, found good support for the 
model. However, a similar meta-analysis done subsequently by the same authors 
with international and U.S. ethnic minority samples did not produce as good a fit 
(Rounds & Tracey, 1996). Other cross-national and cross-ethnic investigations 
have produced mixed results (Fouad & Dancer, 1992; Hansen, 1987; Khan, Alvi, 
Shaukat, & Hussain, 1990; Swanson, 1992). Cross-cultural research on the RL 
A-S-E-C model, not surprisingly, suggests that it may not be equally applicable 
across cultures in all its aspects. 
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In any event, the simplicity of the R-I-A-S-E-C hexagon and the overwhelm- 
ing dominance it has exerted in research efforts over the past 20 years have led 
many investigators to suggest that it may be time for the field to move beyond 
itor, at least, to add more elements to it in order to increase its usefulness.'? Some 
have called for the development of new theoretical structures to account for other 
aspects of occupations, such as abilities and reinforcer preferences, while others 
would like to see additional variables and dimensions investigated (see, e.g., 
Dawis, 1992; Prediger & Vansickle, 1992; R. H. Schwartz, 1992). Holland himself 
has stated that research should be aimed at studying the uses and interpretations 
of interest inventories, and the effects they have on test takers, rather than at fur- 
ther replications of the hexagon across different samples (Holland & Gottfred- 
son, 1992). 

One of the most ambitious efforts to modify and expand Holland's model, 
using new and highly sophisticated data analysis techniques, including multidi- 
mensional scaling, has been carried out by Tracey and Rounds (1996). Their 
three-dimensional representation of vocational interests uses Holland's typology, 
and the two dimensions that presumably underlie it,!° as a point of departure for 
an expanded spherical model that can accommodate a flexible number of interest 
types, depending on the test taker's needs, as well as a prestige dimension. A recent 
issue of the Journal of Vocational Behavior is devoted especially to a presentation of 
this new model and to commentary on it by various investigators. The responses 
to Tracey and Rounds's work are generally quite positive and suggest that it is 
likely to serve as a catalyst for integration and further advances in the field of vo- 
cational theory and measurement (see Borgen & Donnay, 1996; Gonzalez, 1996; 
G. D. Gottfredson, 1996; Hansen, 1996; Harmon, 1996; Prediger, 1996). 

The search for the nature, number, and organization of basic interests parallels 
those conducted in order to identify the primary factors in the realms of ability 
and personality (discussed in chaps. 11 and 13)? In all three instances, the cate- 
gories uncovered from data analyses are a function of the specific variables and 
samples that are used. Moreover, such categories are descriptive, rather than ex- 
planatory, entities whose main value consists of their ability to simplify the col- 
lection and use of information for evaluating and predicting behavior. 


OPINION SURVEYS AND ATTITUDE SCALES 


Nature of Instruments. An attitude is often defined as a tendency to react 
favorably or unfavorably toward a designated class of stimuli, such as a national or 


We Spe ial issue of the Journal of Vocational Behavior (April 1992), which is the source of several of the refer- 

ences cited in this section, was wholly devoted to a consideration of Holland's theory. 

i These bipolar dimensions, identified by Prediger (1982) and widely used since, are People/Things and Data/ 
leas. 


1? p ; i T 2 
( For a wide-ranging and enlightening perspective on historical attempts to identify and classify interest dimen- 
sions, and to map their interrelationships, see Rounds (1995). 
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ethnic group, a custom, or an institution: Tt is evident that, when so defined, 
attitudes cannot be directly observed but must ibe inferred from overt behavior, 
both verbal and nonverbal. In more objective terms, the concept of attitude may 
be said to connote response consistency with regard to certain categories of stimuli. 
In actual practice, the term "attitude" has been most frequently associated with 
social stimuli and with emotionally toned responses. It also often involves value 
judgments. } 

Opinion is sometimes differentiated from attitude, but the proposed distinc- 
tions are neither consistent nor logically defensible. More often the two terms are 
used interchangeably, and they-will be so employed in this discussion. With re- 
gard to assessment methodology, however, opinion surveys are traditionally dis- 
tingifhed from attitude scales. Opinion surveys are characteristically concerned 
with replies to specific questions, which need not be related. The answers to such 
questions are kept separate rather than combined into a total score. An employee 
opinion survey, for example, might include questions about work schedules, rate 
of pay, fringe benefits, company cafeteria, and relations to supervisors; each of 
these items is included because of its intrinsic relevance to the improvement of 
employee relations. The replies to each question are separately tabulated in the 
effort to identify sources of employee satisfaction and dissatisfaction.? 

Attitude scales, op the other hand; typically yield a total score indicating the di- 
rection and intensity of the individual's attitude toward a company, group of peo- 
ple, policy, or other stimulus category: In the construction of an attitude scale, the 
different questions are designed to measure a single attitude or unidimensional 
variable, and some objective procedures are usually followed in the effort to ap- 
proach this goal. An employee attitude scale, for example, yields a single score 
showing the individual’s degree of job satisfaction or overall attitude toward the 


company: 


Major Types of Attitude Scales. ^ In all attitude scales, respondents indicate 
their agreement or disagreement with a series of statements about the object of 
the attitude. Special procedures have been devised in the effort to achieve 
unidimensionality: or homogeneity of items, equality of distances between scale 
units, and comparability of scores from scale to scale. The technical problems 
involved in constructing attitude scales have received extensive attention, arx 
the methodology has made notable theoretical and statistical advances. lt is 
beyond the scope of this text to discuss specialized scaling techniques, which now 
constitute a’ growing area of statistical method (Jones & Koehly, 1993; D. J. 
Mueller, 1986; Ostrom, Bond, Krosnick, & Sedikides, 1994; Procter, 1993; 
Reckase, 1990; Young, 1984). Nevertheless, we can briefly examine three major 


e scale construction that are commonly encountered in the 
ted by the 


approaches to attitud 
psychological testing literature. These approaches are represen 


Thurstone, Guttman, and Likert types of scales. 


18A multivolume series edited by Fink (1995) provides a comprehensive guide specifically devoted to the plan- 


ning, preparation, implementation, and analysis of surveys. 
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Thurstone’s adaptation of psychophysical methods to the quantification of 
judgment data represented an important milestone in attitude scale construction 
(Thurstone, 1959; Thurstone & Chave, 1929). By these procedures, Thurstone 
and his co-workers prepared about 20 scales for measuring attitudes toward war, 
capital punishment, the church, patriotism, censorship, and many other institu- 
tions, practices, issues, and national or ethnic groups. The development of a 
Thurstone-type scale begins with the assembling of many statements expressing a 
wide range of attitudes toward the object under consideration. A large number of 
judges are asked individually to sort the statements into piles (usually 11) for de- 
gree of favorableness. The judges do not indicate their own attitudes; they only 
classify the statements. The median position assigned to each statement by the 
judges is the scale value of that statement. The variability of the judgmfnts is 
taken as an index of its ambiguity, insofar as different judges assign the statement 
to different categories. Items are chosen so as to exhibit minimum variability and 
a wide spread of scale values, approximating equal spacing across the 11-point 
range. In the final attitude scale, the statements are presented in random order, 
with no indication of their scale values. The respondent's score is the median 
scale value of all the statements he or she endorses. 

The Guttman-type scale was originally developed as a technique for determin- 
ing whether a set of attitude statements is unidimensional (L. Guttman, 1944, 
1947). In Guttman’s sense, a perfect scale exists if a respondent who agrees with a 
certain statement of a particular attitude also agrees with milder statements of 
that attitude. In other words, such attitude scale items can be ordered along a 
continuum of intensity or difficulty of acceptance. Each person's position on the 
scale would thus completely determine her or his responses. If we know the most 
extreme statement an individual will accept, we should be able to reproduce all 
her or his responses. In actual practice, such reproducibility cannot be fully at- 
tained, because of errors of measurement in each response; it can only be approx- 
imated within certain limits. The essential procedure in the development of a 
Guttman scale is to identify a set of items that fall into an ordered sequence in 
terms of their endorsement by respondents. Items that do not fit this requirement 
are discarded. A person's score on a Guttman scale is found by examining the pat- 
terns of items that he or she endorses. It may be recalled that the same concept of 
ordinality, or uniform Progression of performance, underlies the Piagetian scales 
discussed in chapters 3 and 9. 

Because the construction of a Thurstone scale requires rather elaborate proce- 
dures, and the conditions of a Guttman scale are difficult to meet in practice, Lik- 
ert (1932) developed a type of scale that is easier to construct while yielding 
equally satisfactory reliability. The Likert-type scale begins with a series of state- 
ments, each of which expresses an attitude that is either clearly favorable or 
clearly unfavorable. Items are selected on the basis of the responses of persons to 
whom they are administered in the process of test construction. The principal 
basis for item selection is internal consistency, although external criteria are also 
employed when available. Likert scales call for a graded response to each state- 
ment. The response is usually expressed in terms of the following five categories: 
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strongly agree (SA), agree (A), undecided (U), disagree (D), and strongly dis- 
agree (SD). To score the scale, the response options are credited 5, 4, 3, 2, or 1 
from the favorable to the unfavorable end. For example, "strongly agree" with a 
favorable statement would receive a score of 5, as would “strongly disagree” with 
an unfavorable statement. The sum of the item credits represents the individual's 
total score, which must be interpreted in terms of empirically established norms. 

Most attitude scales have been developed for use in particular research pro- 
jects. Some were designed for the investigation of employee attitudes and morale. 
Others have been used to assess the outcome of educational and training pro- 
grams. Attitude scales may cont ibute to the evaluation of different instructional 
procedures designed to modify particular attitudes. Or they may be employed in 
measuring the changes in student attitudes toward literature, art, different ethnic 
and cultural groups, or social and economic problems following a given educa- 
tional program. One of the most extensive applications of attitude measurement 
is to be found in research in social psychology. Practically every textbook on so- 
cial psychology contains sections on attitudes and their measurement. Among 
the many problems investigated through attitude measurement may be men- 
tioned group differences in attitudes, the role of attitudes in intergroup relations, 
background factors associated with the development of attitudes, the interrela- 
tion of attitudes (including factor analyses and other multivariate analysis meth- 
ods), trends and temporal shifts in attitudes, and the experimental alteration of 
attitudes through interpolated experiences.!” Relatively few attitude scales have 
been published, although most are fully described in the research literature. An 
extensive collection of early attitude scales constructed for a variety of purposes 
was assembled in a book by M. E. Shaw and Wright (1967). Information about 
more recently developed measures of certain attitudes and values, such as alien- 
ation and anomie, self-esteem, and locus of control, can be found in Measures of 
Personality and Social Psychological ‘Attitudes, edited by Robinson, Shaver, and 
Wrightsman (1991). 


A Note on Gender-Related Variables and Measures. Previous editions 
of this book have included sections covering measures of sex roles and related 
concepts, such as masculinity, femininity, and androgyny, at some length, whereas 
the present one does not. Although research and instruments aimed at assessing 
these phenomena have continued to proliferate (see, e.g., Lenney, 1991), many 
investigators in the field agree that it is in a state of conceptual disarray. From the 
broader perspective of individual differences in gender-related variables, Betz 
(1995) has prepared an excellent update on the current state of knowledge 
regarding such variables. She concludes that the lack of a theoretical 
framework—and of substantive conceptual definitions—has caused progress in 
the study of gender-related phenomena to be limited. To this, it undoubtedly 


A 
19An excellent overview of the vast subspecialty of social psychology that deals with attitudes can be found in 
Eagly and Chaiken (1993). For a briefer review of the literature on attitudes and attitude change, see Olson an: 


Zanna (1993). 
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should be added that the past two decades have also witnessed unprecedented 
changes in the cultural views of gender in the United States and elsewhere 
around the world. At any rate, as Betz points out, global explanatory concepts 
for gender-related. differences, such as sex roles or masculinity-femininity, 
have largely remained unsupported by empirical findings. According to her, in 
order for further progress to occur, more work needs to be done on careful 
conceptualization and definition of constructs that can be placed within a 
meaningful and coherent theoretical framework. 


LOCUS OF CONTROL 


The construct described as "locus of control” first came into prominence with the 
publication of a monograph by Rotter (1966). In this publication, Rotter pre- 
sented the scale he had developed to assess the individual's generalized expectan- 
cies for internal versus external control of reinforcement (I-E Scale). This 
instrument was constructed within the context of social-learning theory. In ex- 
plaining its use, Botter wrote, "The effect of reinforcement following some behav- 
ior, . . is not a simple stamping-in process but depends upon whether or not the 
person perceives a causal relationship between his own behavior and the reward 
(1966, p. 1). Internal control refers to the perception of an event as contingent 
upon one’s behavior or one's relatively permanent characteristics. External con- 
trol, on the other hand, indicates that a positive or negative reinforcement fol- 
lowing some action of the individual is perceived as not being entirely contingent 
upon her or his own action but the result of chance, fate, or luck; or it may be per 
ceived as under the control of powerful others and unpredictable because of the 
complexity of forces surrounding the individual. 

The LE Scale is a forced-choice self-report inventory. Two illustrative items 
are given in Table 14-1. The complete list of items, together with standard in- 
structions for administering them, can be found in Botter: monograph (1966). 
Considerable information regarding the L-E Scale is presented in the original pub- 
lication, including percentile norms on several hundred male and female students 
from a single university, together with means and SDs of some dozen other sam- 
ples comprising mostly college groups. Data on many other groups have subse- 
quently accumulated from independent research projects. A substantial body of 
data on construct validity is also available. Factorial analyses initially indicated 
thata single general factor could account for most of the response variance. Other 
factor analyses, however, subsequently suggested that the construct may be subdi- 
vided into several distinguishable factors, as illustrated by a belief in a difficult 
world, an unjust world, an unpredictable world, and a politically unresponsive 
world (B. E. Collins, 1974). Several later studies likewise found multifactorial 
Structures. 

By the mid- to late 19705, it had been well established that control expectan- 
cies of the kind tapped by the LE Scale could play a meaningful role in helping to 
predict certain behaviors. However, investigators in the field also recognized that 


^ 


“Table 14-1 


Two Illustrative Items from the Internal-External (I-E) Scale 


d i. 


(a) In the long run, people get the respect. : 
(b) Unfortunately, an individual's: 


tries. dius S pa 


GC 


1 


(a) Becominga success ha matter of hard wor 
(b) Gettinga good job depends maint ] 


Instructions state: “This is a questionna find 


events in our society affect different people. . . . Ple 
(and only one) which you more strongly believe to be the : $ 


(From Rotter, 1966, p. 1 1. Reprinted by permission.) 


in order to maximize predictive accuracy, measures of control expectancies 
needed to be targeted to the specific populations and areas of behavior under 
study. Since then, several diverse locus ‘of control scales have been developed. 
Some are tailored for use with different populations, including school-age an 
preschool children (see, e.g» Connell, 1985; Herzberger, Linney, Seidman, & 
Rappaport, 1979; Nowicki & Duke, 1983). Others are aimed at assessing control 
beliefs with regard to specific realms, such as marital satisfaction or mental heal 
status (D. J. Hill & Bale, 1980; P. C. Miller, Lefcourt, & Ware, 1983). Still others 
encompass causal beliefs concerning different areas (ego. achievement and affilia- 
tion) or different spheres of control, such as personal efficacy, interpersonal con- 
trol, and sociopolitical control (Lefcourt, von Baeyer, Ware, & Cox, 1979; 
Paulhus, 1983). Reproductions of several available scales, with basic psychomet- 
ric data, can be found in Lefcourt (1991). ` i z 

The number and range of available instruments to assess the locus of control 
construct attest to its continued vitality, More than 5,000 entries are listed under 
this topic in the psychological literature PsycINFO database for the years 1984 to 
1995. Investigations of the role of control expectancies in health-related prac- 
tices are especially numerous. In addition, locus of control is an important aspect 
of motivation in general and is closely related to other key areas of personality re- 


search, including attribution theory, learned helplessness, and self-efficacy, among 


many others (Skinner, 1995). 


Projective Techniques 


he available supply of projective techniques is large and diversified. In this 
chapter, we consider the major varieties of such techniques, together with 
some well-known examples. Except for special points peculiar to particular 
techniques, no critical evaluation of individual instruments will be under- 
taken. Instead, a summary evaluation of projective techniques will be given in a 
separate section, with emphasis on common methodological problems. Projective 
techniques present a curious discrepancy between research and practice. When 
evaluated as psychometric instruments, the large majority make a poor showing. 
Yet their popularity in clinical use continues unabated (Bellak, 1992; Lubin, 
Larsen, & Matarazzo, 1984; Piotrowski, 1984; Piotrowski, Sherry, & Keller, 1985; 
Piotrowski & Zalewski, 1993; Watkins, 1991). The nature and implications of 
this inconsistency will be examined in the last section. 

The literature on projective techniques is vast, running to over 6,000 refer- 
ences on a single instrument. For a broader coverage of available projective tech- 
niques, the reader is referred to Klopfer and Taulbee (1976), Rabin (1981, 1986), 
and Reynolds and Kamphaus (1990b, chaps. 3-8). The Mental Measurements 
Yearbooks contain reviews of most current instruments. The Journal of Personality 
Assessment,! while open to articles on all types of tests, publishes a wealth of ma- 
terial on various aspects of the use of projective techniques. 


LEE 
! This publication has undergone several title changes since it was started by Bruno Klopfer in 1936 as the 
Rorschach Research Exchange newsletter; from 1950 to 1963 it was called the Journal of Projective Techniques. 
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NATURE OF PROJECTIVE TECHNIQUES 


A major distinguishing feature of projective techniques is to be found in their as- 
signment of a relatively unstructured task, that is, a task that permits an almost un- 
limited variety of possible responses. In order to allow free play to the individual's 
fantasy, only brief, general instructions are provided. For the same reason, the test 
stimuli are usually vague or ambiguous. The underlying hypothesis is that the way 
in which the individual perceives and interprets the test material or "structures" 
the situation will reflect fundamental aspects of her or his psychological function- 
ing. In other words, it is expected that the test materials will serve as a sort of 
screen on which respondents *project" their characteristic thought processes, 
needs, anxieties, and conflicts. 

Typically, projective instruments also represent disguised testing procedures, in- 
sofar as test takers are rarely aware of the type of psychological interpretation that 
will be made of their responses: Projective techniques are likewise characterized 
by a global approach to the appraisal of personality. Attention is focused on a com- 
posite picture of the whole personality, rather than on the measurement of sepa- 
rate traits. Finally, projective techniques are usually regarded by their exponents 
as especially effective in revealing covert, latent, or unconscious aspects of personal- 
ity. Moreover, the more unstructured the test, it is argued, the more sensitive it is 
to such covert material. This follows from the assumption that the more unstruc- 
tured or ambiguous the stimuli, the less likely they are to evoke defensive reac- 
tions on the part of the respondent. 

Projective methods originated within a clinical setting and have remained pre- 
dominantly a tool for the clinician. Some have evolved from therapeutic proce- 
dures (such as art therapy) employed with psychiatric patients. In their 
theoretical framework, most projective techniques reflect the influence of tradi- 
tional and modern psychoanalytic concepts: There have also been scattered at- 
tempts to lay a foundation for projective techniques in stimulus-response theory 
and in perceptual theories of personality (see, €-8» Lindzey, 1961/1977). It should 
be noted, of course, that the specific techniques need not be evaluated in the light 
of their particular theoretical slants or historical origins. A procedure may prove 
to be practically useful or empirically valid for reasons other than those initially 


cited to justify its introduction. 


The Rorschach. One of the most popular projective techniques is that em- 
ploying the Rorschach inkblots (Aronow & Reznikoff, 1983; Aronow, Reznikoff, 
& Moreland, 1994; Erdberg & Exner, 1984; Exner, 1993). Developed by the Swiss 
psychiatrist Hermann Rorschach (1921/ 1942), this technique was first described 
in 1921. Although standardized series of inkblots had previously been used by 
psychologists in studies of imagination and other functions, Rorschach was the 
blots to the diagnostic investigation of the personality as à whole. 
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In the development of this technique, Rorschach experimented with a large 
number of inkblots, which he administered to different psychiatric groups. As a 
result of such clinical observations, those response characteristics that differenti- 
ated between the various psychiatric syndromes were gradually incorporated into 
a scoring system. The scoring procedures were further sharpened by supplemen- 
tary testing of mentally retarded and normal persons, as well as artists, scholars, 
and other distinct groups of people. Rorschach’s methodology thus represented an 
early, informal, and relatively subjective application of criterion keying. 

Due to Rorschach’s untimely death, in 1922, the development of what was in 
effect a test in progress was pursued by his colleagues and students. In the ensuing 
decades, the use of Rorschach’s technique expanded greatly both in Europe and in 
the United States. However, in the absence of a single systematizer, the proce- 
dures for administering, scoring, and interpreting “the Rorschach” proliferated 
and evolved into several methods or systems.” By the 1960s, to speak of it as a sin- 
gle, standardized test was, in fact, inaccurate. The various systems and users 
shared only the 10 original stimulus cards and some basic interpretive postulates 
derived from Rorschach’s original work. 

Each of the Rorschach cards has printed on it a bilaterally symmetrical inkblot 
similar to the one illustrated in Figure 15-1. Five of the blots are executed in 
shades of gray and black only; two contain additional touches of bright red; and 
the remaining three combine several pastel shades. Typically, during administra- 
tion of the Rorchach, the respondent is shown each inkblot, one at a time, and 
asked to tell what the blot could represent. Besides keeping a verbatim record of 
the responses to each card, examiners usually note reaction time and duration of 
responses, position ot positions in which the cards are held, spontaneous remarks, 
emotional expressions, and other incidental behavior of the respondent during 
the test session. At some point following the presentation of the 10 cards, most 
examiners question the individual systematically regarding the parts and aspects 
of each blot to which the associations were given. During this inquiry, the respon- 
dents also have an opportunity to elaborate and clarify on their earlier responses. 

The major differences among the various Rorschach systems that flourished 
from the 1930s to the 1960s were in their scoring methods and, consequently, on 
interpretive matters. Basically, the focus of concern for Rorschach interpretation 
may be placed either on the content of the responses or on their formal character: 
istics, such as location, determinants, form quality, and the various quantitative 
summaries derived from the responses. Although the Rorschach systems differed a 
great deal in the details of scoring and interpreting responses to the cards, many of 
them did share the basic classification of scoring categories. Location refers to the 
part of the blot with which the respondent associates each response. Does he or 
she use the whole blot, a common detail, an unusual detail, white space, or some 
combination of these areas? The determinants of the response include form, color, 
shading, and “movement.” Although there is of course no movement in the blot 


?For a more thorough history of the Rorschach and its evolution, see Exner (1969, 1993). 
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Figure 15-1. An Inkblot of the Type Employed in the Rorschach Technique. 


itself, the respondent's perception of the blot as a representation of a moving ob- 
ject is scored in this category. Further differentiations are made within these cate- 
gories. For example, human movement, animal movement, and abstract or 
inanimate movement are separately scored. Similarly, shading may be perceived 
as representing depth, texture, hazy forms such as clouds, or achromatic reproduc- 
tions of colors as in a photograph. The form quality or form level of responses may 
refer to the precision with which they match the location used, to their original- 
ity, or to both. In addition, the cognitive complexity of responses and other qual- 
itative aspects of the percepts may also be scored in some systems. 

The treatment of content also has varied from one Rorschach system to an- 
other, although certain major categories are regularly employed. Chief among 
these are human figures, human details (or parts of human figures), animal figures, 
and animal details. Other broad scoring categories may include art objects, plants, 
maps, clouds, blood, X-rays, clothing, sexual percepts, and landscapes. A popular- 
ity score is often found on the basis of the relative frequency of different responses 
among people in general. For each of the 10 cards, certain specific responses are 
scored as popular because of their common occurrence. In addition, most systems 
include a tally of unusual or deviant verbalizations made by the test taker during 
the Rorschach; such verbalizations are particularly useful in the detection of se- 
vere forms of psychopathology: 

Further analysis of Rorschach responses typically has been based on the rela- 
tive frequency of responses in the various categories as well as on certain ratios 


and interrelations among different categories. Examples of the sort of qualitative 


interpretations that have commonly been used with Rorschach responses include 
tual thinking, of "color" re- 


the association of “whole” responses with conceptual 

, Deus cen 
sponses with emotionality, and of “human movement" responses with imagina- 
tion and fantasy life. In the usual application of the Rorschach, major emphasis is 
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placed on the final “global” description of the individual, in which the clinician 
integrates results from different parts of the protocol and takes into account the 
interrelations of different scores and indices. In actual practice, information de- 
tived from outside sources, such as other tests, interviews, and case history 
records, is also used in preparing these descriptions. 


Exner’s Comprehensive System. By the 1960s, the Rorschach had fallen 
into disrepute as a psychometric instrument. Researchers found themselves 
stymied by the difficulties inherent in the method itself, such as the variability in 
the total number of responses, the influence of examiner effects, and the 
interdependence of scores, as well as by the proliferation of scoring systems. These 
circumstances made the investigation of the reliability and validity of the 
Rorschach a piecemeal enterprise, fraught with methological flaws, and ultimate- 
ly disappointing in terms of results. Many clinical psychologists continued to use 
the Rorschach regularly, but the-majority of them admitted that they did not 
follow a single system faithfully. Instead, they tended to use the Rorschach data in 
their own preferred way, which could range from totally impressionistic, qualita- 
tive interpretation to adherence to one or more of the systems, more or less strin- 
gently, as they saw fit. 

The vast differences that had developed among the five major Rorschach sys- 
tems in use in the United States were documented by John E. Exner, Jr. (1969) 
who had worked with Samuel Beck and Bruno Klopfer, two of the most diverse 
Rorschach systematizers.? Through his extensive investigations of the clinical use 
of the Rorschach and of its research literature, Exner became intrigued by the 
possibility of distilling into a single system all of the empirically defensible and 
useful features that the method might possess. For the past quarter century, he has 
led the most ambitious and fruitful effort ever mounted to put the Rorschach ona 
GE sound basis (Exner, 1974, 1991, 1993, 1995; Exner & Weiner, 

First, Exner developed a comprehensive Rorschach system that integrated ele- 
ments culled from the five major approaches. In this Comprehensive System, 
Exner provides standardized administration, scoring, and interpretative proce- 
dures selected on the basis of empirical comparisons among various practices. The 
emphasis is on structural rather than on content variables. In fact, according to 
Exner, the object of scoring responses is the derivation of the structural summary 
which is at the core of his system and provides the basis for most interpretive pos- 
tulates. Each response is coded on several different scoring categories, including 
location, determinants, form quality, contents, organizational activity, and popu- 
larity, among others. The coded responses are listed and the frequencies of the 
codes are tallied; these elements are then used in the calculation of the ratios, 
percentages, and indices that complete the structural summary. Interpretative 
statements can be derived, in light of the entire Rorschach record, from variables 


>The other three major systems were those of Marguerite Hertz, Zygmunt Piotrowski, and David Rapaport with 
Roy Schafer. i 
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at various levels of complexity. Some hypotheses are linked to simple frequencies, 
like the extent to which a single determinant (e.g., shading) is used; others are 
based on the combined occurrence of two or more variables, such as the number 
of human and animal contents. The most complex level of analysis involves con- 
stellations of several variables and empirically derived cutoff scores. These vari- 
ables are grouped into indices (e.g., Schizophrenia Index, Depression Index; and 
Coping Deficit Index) which presumably reflect the likelihood that certain disor- 
ders or conditions are present. : 

Using this uniform system, which has evolved and been refined over the past 
two decades, Exner and. his associates have collected a considerable body of psy- 
chometric data, including: norms on adults, children, and adolescents as well as 
various psychiatric. reference samples. Studies of retest reliability over several 
time intervals, ranging from a few days to three years, indicate considerable tem- 
poral stability for most of the scored variables. The careful delineation of scoring 
guidelines for the Comprehensive System has made it possible for trained exam- 
iners to obtain fairly high rates of interscorer agreement. In fact, one of the major 
contributions of Exner's work is the provision of a uniform Rorschach system that 
permits comparability among the research findings of different investigators. It is 
therefore not surprising that the Comprehensive System has become the most fre- 
quently taught approach. to scoring and interpreting the Rorschach and has 
proved to be of value in increasing the statistical power of Rorschach research 
(see, e.g., Acklin, McDowell, & Orndoff, 1992; Ritzler & Alter, 1986). 

In spite of the clear methodological: improvements that Exner's system has 
brought to the Rorschach, several: important questions remain. The chief and 
most complicated of these is the issue of validity. The literature bearing on this 
question is enormous and still full of contradictory findings. The Rorschach, like 
the MMPI, has been used for a wide array of purposes, many of which go well be- 
yond the original intent of its author; and this multiplicity of uses adds to the 
complexity of investigating its validity. In general, meta-analytic research has 
shown that indices of convergent validity for the Rorschach are comparable to 
those obtained. with the MMPI (Atkinson, Quarrington, Alp, & Cyr 1986; 
K. C. H. Parker, Hanson, & Hunsley, 1988). In addition, Exner’s own extensive 
work with the Rorschach lends substantial support to the validity of many con- 
structs assessed through his system, and to its usefulness in describing certain as- 
pects of personality functioning. However, when it comes to diagnosing complex 
current conditions ot predicting future behavior, the research results have been 
mixed (Exner, 1996; Weiner, 1994a; Wood, Nezworski, & Stejskal, 1996a, 1996b). 


A major complicating factor in the interpretation of Rorschach scores is the 
ivity, or R. When large 


total number of responses—known as response product 


Ge poi 1 
4Computer-based scoring and interpretation assistance programs: for the Comprehensive System are available 


and require only the initial coding of responses- 

5However, the issue-of how best to quantify interrater agreement on 
cussion of this problem and a comparison of three different met 
Rorschach data, see McDowell and Acklin (1996). 


the Rorschach is not yet settled, For a dis- 
thods of estimating scorer reliability for 
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differences in R occur between individuals or groups, differences in other scoring 
categories are likely to follow. Thus, the differences found in certain categories 
may be only an artifact resulting from variations in R. To this intrinsic character- 
istic of Rorschach scores may be added the fact that response productivity appears 
to be related to other variables, such as intellectual level and amount of educa- 
gon D Characteristically, Rorschach investigators differ in their views of the sig- 
nificance and extent of the problems presented by variations in response 
productivity. G. J. Meyer (1992, 1993), for example, has called for further investi- 
gation of the psychological meaning of R and of the merits versus disadvantages 
of controlling or adjusting for this variable. Others consider that its impact has al- 
ready been shown to be insignificant in most circumstances (Exner, 1992; 
Weiner, 1995b). Still others maintain that the problems posed by R need to be 
handled in different ways depending on the use that is to be made of the data and, 
in the case of research data, depending on how much the distributions depart 
from normality (Kinder, 1992; Lipgar, 1992). 

The Comprehensive System developed by Exner is not without its critics. 
Many Rorschach users have objected to the atheoretical nature of Exner’s ap- 
proach and to its underutilization of content data—which, they contend, seri- 
ously reduces its clinical utility. Critics have also noted that the system is 
exceedingly complex, as well as occasionally vague and contradictory. Moreover, 
Exner'5 own research has been faulted for its small sample sizes, large numbers of 
variables, dearth of cross-validation studies, and inaccesibility to public scrutiny.’ 
Be that as it may, the fact remains that the availability of Exner's system, together 
with the research data he and his associates have compiled, has injected new life 
into the Rorschach as a psychometric instrument. 


Alternative Approaches. In spite of the wide acceptance that Exners 
Comprehensive System has achieved, there are still a number of contrasting 
approaches to the Rorschach that coexist with it. In fact, many have credited 
Exner's work with reviving interest in the Rorschach from these other perspec- 
tives as well as from his own. One alternative, more strongly clinical in its orien- 
tation, is described by Aronow and his colleagues (Aronow & Reznikoff, 1976, 
1983; Aronow et al., 1994, 1995). This approach treats the Rorschach essentially 
as a standardized clinical interview that samples a person's perceptual operations. 
It focuses on the interpretation of content, rather than on structural variables or 
perceptual determinants of responses. Nevertheless, available content scales and 
Scoring systems are not considered sufficiently dependable as psychometric 
instruments for use in individual diagnosis. Rather, these authors recommend a 


E e related eeh) the research literature on the Rorschach's susceptibility to malin- 
» is ien subjects are asked to fake n ax X AS É 
ductivity (G. G. Perry & Kinder, 1990). psychosis on the Rorschach, there is a reduction in response pro: 
"For a sampling of these and other criticisms of Exner's work, ^ 

see Aronow, Reznikoff, and Moreland (1995), 
Kleiger (1992), P. M. Lerner (1994), W. Perry (1993), Vigli 1989), Vi d H d Wood, 
Nezworski, and Stejskal (1996a, 1996b). iglione ( ), Vincent and Harman (1991), an 
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strictly clinical application of the Rorschach as a means of enhancing the idio- 
graphic understanding of the individual case—and they observe that most experi- 
enced clinicians gravitate toward this approach because of its usefulness in the 
process of psychotherapy. Their interpretations rely principally on the content of 
responses, supplemented by verbal and nonverbal behavior. On the basis of 
available’ research and clinical experience, Aronow and his coworkers: have 
prepared a set of guideliness for more effective and dependable idiographic 
interpretations. For example, they suggest that responses which depart from the 
commonplace and those less closely bound to the stimulus properties of particular 
blots are more likely tobe significant in the individual case. Similarly, the authors 
caution against rigid systems of symbol interpretation that attach fixed meanings 
to content categories or assign invariant evocative qualities to the Rorschach 
blots. Instead; they offer procedures for investigating the meaning of responses in 
a conservative fashion that is compatible with. general psychodynamic principles 
and makes use of the individual's own experiential history. 

Another alternative current approach to the Rorschach is the one espoused by 
Paul Lerner (1991). This work stands in sharp contrast to Exner's atheoretical 
stance in that it is steeped in modern psychoanalytic theory, as it has evolved 
since the 1970s. Whereas Exner (1989) contends that the Rorschach is a test in 
which projection rarely comes into play, Lerner considers it to be fundamentally a 
projective method for accessing the internal. world of the individual? Lerner's 
text provides a guide to the clinical uses of the Rorschach and to its research ap- 
plications in the assessment of object representations, defensive maneuvers, and 
other concepts that are central to modern psychodynamic theories. 

A special clinical application of the Rorschach is illustrated by the consensus 
Rorschach (Aronow'et al., 1994, chap. 13; Blanchard, 1968; Cutter & Farberow, 
1970). In this adaptation; the inkblots are presented for joint interpretation by 
married couples or other family members, coworkers, juvenile gang members, or 
other natural groups: Through discussion and negotiation, the participants must 
reach agreement on a single, common set of responses. The technique has been 
used, with apparent success, as a basis for exploring interpersonal relationships 
and various other kinds of social behavior. ; 

The Rorschach has been described, accurately, as "a test that has repeatedly 
outlived its obituaries" (Peterson, 1994, p. 396). Its demise has been predicted 
many times because, like all of the most frequently used psychological tests, the 
Rorschach has been among the most misused as well. Currently, the technique is 
experiencing a tremendous resurgence both in terms of research activity and E 
orizing; although different ways of approaching the Rorschach I users of a 
types appear convinced that it has special value in studying perceptual, cognitive, 


and affective aspects of personality functioning. Some are satished viewing the 


Rorschach simply as a method for generating data that can be used from various 


The reasons for this apparent contradiction are rooted in the divergent views of these authors concerning the 
Jc on the Rorschach. 


nature of projection and of the examinee's tas! 
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perspectives, while others are engaged in efforts to link the empirical and theoret- 
ical traditions as well as different theoretical approaches to the Rorschach into a 
totally integrated system (see, e.g., Acklin, 1995; Blatt, 1990; P. M. Lerner, 1994; 
Meloy & Singer, 1991; Weiner, 1994b; Willock, 1992). 


The Holtzman Inkblot Technique. | Even before Exner began to work on his 
comprehensive Rorschach system, a serious effort to apply a psychometric orien- 
tation to the inkblot technique was undertaken by Wayne H. Holtzman. Modeled 
after the Rorschach, the Holtzman Inkblot Technique (HIT) was so designed as 
to eliminate the principal technical deficiencies of the earlier instrument (Holtzman, 
1961, 1986; Holtzman, Thorpe, Swartz, & Herron, 1961). The changes in stimulus 
materials and procedures are sufficiently extensive, however, to require that the 
Holtzman be regarded as a different test and evaluated without reference to the 
Rorschach. The Holtzman technique provides two parallel series of 45 cards each; 
the blots were selected from a large preliminary pool on the basis of empirical 
ctiteria aimed at maximizing their effectiveness. Only one response per card is 
obtained. Both achromatic and colored cards are included; a few inkblots are 
markedly asymmetric. ' 

Administration and scoring of the HIT were well standardized and clearly de- 
sctibed from the outset. Scores are obtained in 22 response variables, including 
many that parallel those of the Rorschach and some additional variables, such as 
anxiety and hostility. For each variable, percentile scores are available for normal 
samples of children and adults and for a number of deviant groups (E. F. Hill, 
1972; Holtzman, 1975). Scorer reliability appears to be highly satisfactory. Split- 
half, alternate-form, and test-retest reliability investigations have shown differ- 
ences across the response variables, though most of the results are encouraging. A 
group form of the test, using slides, yields scores on most variables that are compa- 
rable to those obtained through individual administration (Holtzman, Moseley, 
Reinehr, & Abbott, 1963; Swartz & Holtzman, 1963). The HIT 25, a short ver- 
sion that consists of the first 25 cards from Form A of the HIT with 2 responses 
per card, has been proposed recently by Holtzman (1988) and is in the process of 
being normed (Swartz, 1992). 

Considerable validity data on the HIT have been accumulated, mostly with 
quite promising results (Gamble, 1972; Holtzman, 1975, 1986, 1988; Leichsen- 
ting, 1991; Sacchi & Richaud de Minzi, 1989; Swartz, 1973). The validation re- 
search has followed a variety of approaches, including the study of developmental 
trends, cross-cultural comparisons, correlations with other tests and with behav- 
ioral indicators of personality characteristics, and contrasted group comparisons 
with both normal respondents and psychiatric patients. A handbook prepared by 
E. E Hill (1972) is directed particularly to the clinical use of the HIT. 

It is apparent that the HIT has some psychometric advantages over the 
Rorschach. The availability of parallel forms permits not only the measurement 
of retest reliability but also adequate follow-up studies. The restriction of re- 
sponses to one per card holds response productivity (R) constant for each respon- 
dent, thus avoiding many of the pitfalls of Rorschach scoring. It should be noted, 
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however, that response length (number of words) is still uncontrolled and, as in 
the Rorschach, ‘has proved to be significantly related to several HIT scores 
(Megargee, 1966). In spite of its advantages, however, there is a relative paucity 
of information on the HIT compared to the Rorschach, and more data are needed 
to establish the diagnostic significance of the various scores and the construct va- 
lidity of the personality variables assessed by this technique (for reviews, see 
Cundick, 1985; Dush, 1985). 


PICTORIAL TECHNIQUES 


Thematic Apperception Test. In contrast to inkblot techniques, the 
Thematic Apperception Test (TAT) presents more highly structured stimuli and 
requires more complex and. meaningfully organized verbal responses. Interpreta- 
tion of responses by the examiner is usually based on content analysis of a rather 
qualitative nature. First developed by Henry Murray and his staff at the Harvard 
Psychological: Clinic (Murray et al., 1938), the TAT has not only been widely 
used in clinical practice and research, but it has also served as.a model for the 
development of many other instruments (J. W. Atkinson, 1958; Bellak, 1993; 
Dana, 1996b; R. Harrison, 1965; Holmstrom, Silber, & Karp, 1990; Klopfer & 
Taulbee, 1976, pp- 554-558; Obrzut & Boliek, 1986). 

The TAT materials consist of 19 cards containing vague pictures in black and 
white and one blank card.2 The respondent is asked to make up a story to fit 
each picture, telling what led up to the event shown in the picture, describing 
what is happening at the moment and what the characters are feeling and 
thinking, and giving the outcome. In the case of the blank card, the respondent 
is instructed to imagine some picture on the card, describe it, and then tell.a 
story about it. The original procedure outlined by Murray in the test manual re- 
quires two one-hour sessions, 10 cards being employed during each session. The 
cards reserved for the second session were deliberately chosen to be more un- 
usual, dramatic, and. bizarre, and the accompanying instructions urge respon- 
dents to give free play to their imagination. Four overlapping sets of 20 cards are 
available—for. boys, girls, males over 14, and females over 14. Most clinicians 
use abridged sets of specially selected cards, seldom giving more than 10 cards to 
a single respondent. 


In the original method of interpreting TAT stories (Murray et al., 1943), the 


examiner first determines who is the "hero," the character of either sex with 


whom the respondent has presumably identified herself or himself. The content 
list of "needs 


of the stories is then analyzed principally in reference to Murray? list | 
and "press." Some of the proposed needs were described in chapter 13, in connec- 
tion with the Edwards Personal Preference Schedule. Examples include achieve- 
ment, affiliation, and aggression. Press refers to environmental forces that may 


of the origin of the TAT images, see W. G. Morgan (1995). 
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For an intriguing historical account 
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facilitate or interfere with the satisfaction of needs. Being attacked or criticized, 
receiving affection, being comforted, and exposure to physical danger are illustra- 
tions of press. In assessing the importance or strength of a particular need or press 
for the individual, special attention is given to the intensity, duration, and fre- 
quency of its occurrence in different stories, as wells to the uniqueness of its as- 
sociation with a given picture. The assumption is made that unusual material, 
which departs from the common responses to each picture, is more likely to have 
significance for the individual. 

A fair amount of normative information has been published regarding the most 
frequent response characteristics for each card, including the way each card is per- 
ceived, the themes developed, the roles ascribed to the characters, emotional 
tones expressed, speed of responses, length of stories, and the like (J. W. Atkin- 
son, 1958; W. E. Henry, 1956; Murstein, 1972). Although these normative data 
provide a general framework for interpreting individual responses, most clinicians 
rely heavily on "subjective norms" built up through their own experience with 
the test and on the knowledge they have acquired about the examinee through 
other means. A number of quantitative scoring schemes and rating scales have 
been developed that yield good scorer reliability. Since their application is rather 
time-consuming, however, such scoring procedures are seldom used in clinical 
practice. Although typically given as an individual oral test in the clinical situa- 
tion, the TAT may also be administered in writing and as a group test. 

The TAT has been used extensively in personality research. Unfortunately, the 
wide diversity of administration and scoring procedures, and even of stimulus ma- 
terials, associated with the TAT rubric has extended to research uses as well as 
clinical practice (Keiser & Prather, 1990). This diversity has made it very difficult 
to investigate the psychometric properties of “the TAT” as a distinct psychologi- 
cal test because the method lacks the basic requirement of uniformity. In addi- 
tion, a considerable body of experimental data is available to show that such 
conditions as hunger, sleep deprivation, and social frustration significantly affect 
TAT responses (J. W. Atkinson, 1958). While supporting the projective hypothe- 
sis, the sensitivity of the TAT to such temporary conditions may complicate the 
meaning of responses. The question of internal consistency of TAT responses has 
also received attention (J. W. Atkinson, 1981; Entwisle, 1972). And some control 
should be applied for length of stories, or productivity—a problem that the TAT 
shares with the Rorschach (J. W. Atkinson & Raynor, 1974, chap. 3). 

Nevertheless, the value of thematic apperception techniques, in general, and 
of the TAT, in particular, is not in question. Recent investigations confirm the 
clinical utility of various versions of the TAT both for traditional applications, 
such as assessing extent of psychopathology and use of defense mechanisms, and 
for novel uses, such as the evaluation of problem-solving skills (Cramer & Blatt, 
1990; Hibbard et al., 1994; Ronan, Colavito, & Hammontree, 1993; Ronan, 
Date, & Weisbrod, 1995). One of the most promising applications of the TAT is 
with newly developed scales for the clinical assessment of object relations (Al- 
varado, 1994; Barends, Westen, Leigh, Silbert, & Byers, 1990; Freedenfeld, Orn- 
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duff, & Kelsey, 1995; Westen, 1991; Westen, Lohr, Silk, Gold, & Kerber 1990). 
Nor is the usefulness of the TAT confined to the thematic analysis of responses; 
formal characteristics of both the structure and content of TAT stories may also 
be fruitfully employed in the study of individuals and groups (see, e.g., Cramer, 
1996; McGrew & Teglasi, 1990; Teglasi, 1993). 


Adaptations of the TAT and Related Tests- Many adaptations of the TAT 
have been developed for special purposes. These exhibit varying degrees of 
resemblance to the original. Where to draw the line between the modified ver- 
sions of the TAT and new tests based on the same general approach as the TAT is 
arbitrary. Several versions of the TAT have been prepared for use in surveys of 
attitudes toward labor problems, minority groups, authority, and the like (D. T. 
Campbell, 1950; R. Harrison, 1965). Other adaptations have been developed for 
use in career counseling, executive appraisal, and a wide variety of research 
projects. Forms have also been constructed for special populations, including 
preschool children, elementary school children, disabled children, adolescents, 
and various national and ethnic groups (R. Harrison, 1965). 

Some TAT adaptations have focused on the intensive measurement of a single 
need or drive, such as sex or aggression: Of special interest is the extensive re- 
search on the achievement need (n-Ach) conducted over some thirty years by 
McClelland; Atkinson, and their associates (J. W. Atkinson, 1958; J. W. Atkinson 
& Feather, 1966; J. W. Atkinson & Raynor, 1974; McClelland, 1985; McClelland, 
Atkinson; Clark, & Lowell, 1953/ 1976). To measure n-Ach, four pictures were 
employed, two of which were taken from the TAT. Detailed schemes have been 
developed for scoring the resulting stories with regard to expressions of n-Ach. 
This technique has been used in an extensive program of research on achieve- 
ment motivation. The problems investigated range from basic motivation theory 
(J. W. Atkinson & Feather, 1966) to the social origins and consequences of n-Ach 
and its role in the riseand fall of societies (McClelland, 1961/1976). Meta- 
analyses of studies comparing TAT and questionnaire measures of n-Ach suggest 
that these two methods are both valid, albeit for different purposes and different 
aspects of the assessment of achievement drive (Spangler, 1992). 

A compendium of the scoring systems used in the content analysis of verbal 
material has been prepared by Charles Smith in association with John W Atkin- 
son, David C. McClelland, and Joseph Veroff (1992). Included are scoring sys- 
tems with research traditions of long standing (e.g., those for the achievement, 
affiliation, and power motives), and many others dealing with topics as diverse as 
political ideology and coping capacity. Conceptual matters as well as method- 
ological considerations in the sampling, scoring, and analysis of verbal material 
are discussed. Although many of the systems described are based on modifications 
of the TAT and reflect Murray's views, several other theoretical perspectives are 
also represented. Moreover, the systems that are included are meant for the analy- 
sis of the manifest rather than symbolic content of thought samples—as opposed 
to “projections” —and for research rather than clinical use. 
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facilitate or interfere with the satisfaction of needs. Being attacked or criticized, 
receiving affection, being comforted, and exposure to physical danger are illustra- 
tions of press. In assessing the importance or strength of a particular need or press 
for the individual, special attention is given to the intensity, duration, and fre- 
quency of its occurrence in different stories, as well as to the uniqueness of its as- 
sociation with a given picture. The assumption is made that unusual material, 
which departs from the common responses to each picture, is more likely to have 
significance for the individual. 

A fair amount of normative information has been published regarding the most 
frequent response characteristics for each card, including the way each card is per- 
ceived, the themes developed, the roles ascribed to the characters, emotional 
tones expressed, speed of responses, length of stories, and the like (J. W. Atkin- 
son, 1958; W. E. Henry, 1956; Murstein, 1972). Although these normative data 
provide a general framework for interpreting individual responses, most clinicians 
rely heavily on "subjective norms" built up through their own experience with 
the test and on the knowledge they have acquired about the examinee through 
other means. A number of quantitative scoring schemes and rating scales have 
been developed that yield good scorer reliability. Since their application is rather 
time-consuming, however, such scoring procedures are seldom used in clinical 
practice. Although typically given as an individual oral test in the clinical situa- 
tion, the TAT may also be administered in writing and as a group test. 

The TAT has been used extensively in personality research. Unfortunately, the 
wide diversity of administration and scoring procedures, and even of stimulus ma- 
terials, associated with the TAT rubric has extended to research uses as well as 
clinical practice (Keiser & Prather, 1990). This diversity has made it very difficult 
to investigate the psychometric properties of “the TAT” as a distinct psychologi- 
cal test because the method lacks the basic requirement of uniformity. In addi- 
tion, a considerable body of experimental data is available to show that such 
conditions as hunger, sleep deprivation, and social frustration significantly affect 
TAT responses NE W. Atkinson, 1958). While supporting the projective hypothe- 
sis, the sensitivity of the TAT to such temporary conditions may complicate the 
meaning of responses. The question of internal consistency of TAT responses has 
also received attention (J. W. Atkinson, 1981; Entwisle, 1972). And some control 
should be applied for length of stories, or productivity—a problem that the TAT 
shares with the Rorschach (J. W. Atkinson & Raynor, 1974, chap. 3). 

Gs e E web of thematic apperception techniques, in general, and 
Dä d e ular, is not in question. Recent investigations confirm the 
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duff, & Kelsey, 1995; Westen, 1991; Westen, Lohr, Silk, Gold, & Kerber 1990). 
Nor is the usefulness of the TAT confined to the thematic analysis of responses; 
formal characteristics of both the structure and content of TAT stories may also 
be fruitfully employed in the study of individuals and groups (see, e.g. Cramer, 
1996; McGrew & Teglasi, 1990; Teglasi, 1993). 


Adaptations of the TAT and Related Tests. Many adaptations of the TAT 
have been developed for special purposes. These exhibit varying degrees of 
resemblance to the original. Where to draw the line between the modified ver- 
sions of the TAT and new tests based on the same general approach as the TAT is 
arbitrary. Several versions of the TAT have been prepared for use in surveys of 
attitudes toward labor problems, minority groups, authority, and the like (D. Ir. 
Campbell, 1950; R. Harrison, 1965). Other adaptations have been developed for 
use in career counseling, executive appraisal, and a wide variety of research 
projects. Forms have also been constructed for special populations, including 
preschool children, elementary school children, disabled children, adolescents, 
and vatious national and ethnic groups (R. Harrison, 1965). 

Some TAT adaptations have focused on the intensive measurement of a single 
need or drive, such as sex or aggression. Of special interest is the extensive re- 
search on the achievement need (n-Ach) conducted over some thirty years by 
McClelland, Atkinson, and their associates (J. W. Atkinson, 1958; J. W. Atkinson 
& Feather, 1966; J. W. Atkinson & Raynor, 1974; McClelland, 1985; McClelland, 
Atkinson, Clark, & Lowell, 1953/ 1976). To measure n-Ach, four pictures were 
employed, two of which were taken from the TAT. Detailed schemes have been 
developed for scoring the resulting stories with regard to expressions of n-Ach. 
This technique has been used in an extensive program of research on achieve- 
ment motivation. The problems investigated range from basic motivation theory 
(J. W. Atkinson & Feather, 1966) to the social origins and consequences of n-Ach 
and its role in the rise'and fall of societies (McClelland, 1961/1976). Meta- 
analyses of studies comparing TAT and questionnaire measures of n-Ach suggest 
that these two methods are both valid, albeit for different purposes and different 
aspects of the assessment of achievement drive (Spangler, 1992). 

A compendium of the scoring systems used in the content analysis of verbal 
material has been prepared by Charles Smith in association with John W: Atkin- 
son, David C. McClelland, and Joseph Veroff (1992). Included are scoring sys- 
tems with research traditions of long standing Leg, those for the achievement, 
affiliation, and power motives), and many others dealing with topics as diverse as 
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Although the original TAT is said to be applicable to children as young as 4 
years of age,!° the Children’s Apperception Test (CAT) was especially designed 
for use between the ages of 3 and 10 years (Bellak, 1993). The CAT cards substi- 
tute animals for people on the assumption that young children project more read- 
ily to pictures of animals than to pictures of humans. The various animals in the 
pictures are portrayed in typically human situations, in the characteristic anthro- 
pomorphic fashion of comic strips and children's books. The pictures are designed 
to evoke fantasies relating to problems of feeding and other oral activity, sibling 
rivalry, parent-child relations, aggression, toilet training, and other childhood ex- 
periénces. The authors of the CAT prepared a human modification of the test 
(CAT-H) for use with older children, especially those with a mental age beyond 
10 years (Bellak & Hurvich, 1966). The authors maintain that either the human 
or animal form may be more effective, depending on the age and personality char- 
acteristics of the child (for reviews, see Hatt, 1985; Shaffer, 1985). 

A more recently developed test, the Roberts Apperception Test for Children 
(RATC), comes closer to meeting psychometric standards of test construction and 
evaluation than do other techniques of this type (McArthur & Roberts, 1982; see 
also Sines, 1985). The RATC provides two overlapping sets of 16 stimulus cards, 
one for boys and one for girls. A supplementary set with pictures of Black children 
is also available but has not been normed. The pictures were chosen to depict fa- 
miliar interpersonal situations involving children in their relations with adults or 
with other children (see Fig. 15-2). The stories are scored on a series of scales cov- 
ering the types of problems for which children are commonly brought to clinics. 
Clear and explicit guidelines permit fairly objective scoring of responses; norms are 
based on the responses of 200 teacher-nominated well-adjusted children. Compar- 
isons of these responses with those of 200 children seen at child guidance clinics 
provided some of the validation data cited in the manual. Clearly, this instrument 
represents a serious effort to combine the flexibility of projective techniques with 
the administration, scoring, and test evaluation procedures of a standardized test. 
Investigations of the validity of the RATC for various uses continue to show favor- 
able results (see, e.g., Palomares, Crowley, Worchel, Olson, & Rae, 1991). In addi- 
tion, a handbook with detailed guidelines for scoring and interpreting the RATC 
in clinical use has been prepared by Glen E. Roberts (1994). 

TEMAS is the Spanish word for “themes” and the clever acronym for “Tell- 
Me-A-Story,” an instrument specifically designed for the assessment of cognitive, 
affective, and personality characteristics of children from 5 to 18 years of age 
(Costantino, Malgady, & Rogler, 1988). TEMAS uses two parallel sets of stimulus 
cards in full color, one for ethnic minority children and one for White children. 
The stimulus materials were carefully developed to facilitate verbal production 
and stimulate stories dealing with choices among conflicting goals, such as imme- 
diate versus delayed gratification. The minority set depicts characters whose 
features and dark complexions suggest a Black or Hispanic origin. Although 


‘For specific information regarding the clinical use of the TAT and other storytelling techniques with children 
and adolescents, see Teglasi (1993) and Worchel and Dupree (1990). 
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Figure 15-2. Oneofthe Pictures Used in the Roberts Apperception Test for 
Children. 
(Copyright ©1982 by Western Psychological Services. Reproduced by permission.) 


TEMAS has been praised as a distinct improvement over the original TAT cards 
in terms of its suitability for African American and Hispanic American children, 
the psychometric properties of this instrument, especially its test-retest reliability 
and internal consistency, have been repeatedly called into question (for reviews, 
see Dana, 1993, chap. 8; Lang, 1992; Ritzler, 19933). 

Similar thematic apperception tests have been developed for the aged, includ- 
ing the Gerontological Apperception Test (Wolk & Wolk, 1971) and the Senior 
Apperception Test (Bellak, 1993; Bellak & Bellak, 1973). Both employ sets of 
r more elderly persons and illustrating problems that may be 
of concern to the aged, such as loneliness, family difficulties, and helplessness. 
Both instruments have been criticized because of premature publication and the 
use of pictures that tend to perpetuate adverse stereotypes of aging (J. P. Schaie, 
1978; K. W. Schaie, 1978). Moreover, neither of these instruments has been 
shown to have advantages Over the TAT in testing elderly persons (Fitzgerald, 
Pasewark, & Fleisher, 1974; Foote & Kahn, 1979); and the Gerontological Ap- 


perception Test is no longer in print. 


cards featuring one o 


The Rosenzweig Picture-Frustration Study. The TAT and related 
techniques we have just considered use pictures to stimulate the free play of 
fantasy and to evoke elaborate verbal responses. In contrast, the Rosenzweig 
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Picture-Frustration Study (P-F Study), described in this section, is more circum- 
scribed in coverage and calls for simpler responses. This instrument is available in 
separate forms for adults, aged 14 and over (Rosenzweig, 1950, 1978a, 1978d); for 
adolescents, aged 12 to 18 (Rosenzweig, 1970, 1976b, 1981a); and for children, 
aged 4 to 13 (Rosenzweig, 1960, 1977, 1981b, 1988). Derived from the author's 
theory of frustration and aggression, the P-F Study presents a series of cartoons in 
which one person frustrates another or calls attention to a frustrating condition. 
Two of these cartoons, taken from the children’s form, are shown in Figure 15-3. 
In the blank space provided, the respondent writes what the frustrated person 
would reply. 

Responses on the P-F Study are classified with reference to type and direction of 
aggression. Types of aggression include: obstacle-dominance, emphasizing the frus- 
trating object; ego-defense, focusing attention on the protection of the frustrated 
person; and need-persistence, concentrating on the constructive solution of the 
frustrating problem. Direction of aggression is scored as: extraggressive, or turned 
outward onto the environment; intraggressive, or turned inward upon oneself; and 
imaggressive, or turned off in an attempt to gloss over or evade the situation. In 
scoring the test, the percentage of responses falling into each of these categories is 
compared with the corresponding normative percentages. A group conformity rat- 
ing (GCR), showing the individual's tendency to give responses that agree with 
the modal responses of the standardization sample, may also be obtained. 

Being more limited in coverage, more highly structured, and relatively objec- 
tive in its scoring procedures, the P-F Study lends itself better to statistical analy- 


I’m not going Be quiet! 


to ask you to Mother wants 


my birthday to sleep. 


Figure 15-3. Typical Items from the Rosenzweig Picture-Frustration Study, 
Children’s Form. 


(Copyright © 1976 by Saul Rosenzweig. Reproduced by permission.) 
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sis than do most other projective techniques. Systematic efforts have been made 
from the outset to gather norms and to check its reliability and validity. Over 
some fifty years, considerable research has been conducted with the P-F Study by 
both the test author and other investigators. This research literature deals with 
the psychometric properties of the instrument and with such topics as clinical di- 
agnosis, developmental changes, sex differences, cultural differences, and the re- 
lationship between humor and aggression (Graybill, 1990, 1993; Nevo & Nevo, 
1983; Rosenzweig, 19762; 1978b, 1978c; Rosenzweig & Adelman, 1977—for re- 
views, see Viglione, 1985; Wagner, 1985). 


VERBAL TECHNIQUES 


Although allyprojective instruments discussed thus far require verbal responses, 
certain projective techniques are wholly verbal, utilizing only words in both stim- 
ulus materials and responses. Some of these verbal techniques can be adminis- 
tered in either oral or written form, but all are suitable for written group 
administration. When so administeted, of course, they presuppose a minimum 
reading level and thorough familiarity with the language in which the test was de- 
veloped. These requirements thus preclude the use of such techniques with young 
children or with persons who are illiterate or who do not speak English. 

A technique that antedated the flood of projective tests by more than half a 
century is the word association test. Originally known as the "free association test, 
this technique was first systematically described by Galton (1879). Wundt and J. 
McK. Cattell subsequently introduced it into the psychological laboratory, where 
it was adapted to many uses- The procedure involves simply the presentation ofa 
series of disconnected words, to each of which the individual is told to respond by 


ration of thinking processes.. 1 

The clinical application of word association. methods was stimulated largely by 
the psychoanalytic movement, although other psychiatrists, such as Kraepelin, 
had previously investigated such techniques. Among the psychoanalysts, Jung’s 
contribution to the systematic development of the word association test is most 
conspicuous. Jung (1910) selected stimulus words to represent common emo- 
tional complexes” and analyzed the responses with reference to reaction time, 
content, and physical expressions of emotional tension. Over thirty years later, a 
similar word association technique was developed at the Menninger Clinic by 
Rappaport and his associates (1946/1968). ‘According to its authors, the test had 
a dual aim: to aid in detecting impairment of thought processes and to suggest sig- 
nificant areas of conflict. Mention may also be made of the use of the word associ- 
ation technique as a “lie detector.”!! This application was likewise initiated by 


4 
the word association technique is no longer used in this manner. Fora discussion of more recent applications 
of the "lie detector" or polygraph method in industrial/organizational settings, see chapter 17. 
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Jung and was subsequently subjected to extensive research—both in the labora- 
tory and in practical situations (Burtt, 1931; Lindsley, 1955). The rationale of- 
fered to justify the employment of word association in the detection of lying or 
guilt was similar to that applied in using it to uncover areas of emotional conflict. 

A different approach to the word association test is illustrated by the early 
work of Kent and Rosanoff (1910). Designed principally as a psychiatric screen- 
ing instrument, the Kent-Rosanoff Free Association Test utilized completely ob- 
jective scoring. The stimulus words consisted of 100 common, neutral. words, 
chosen because they tend to evoke the same associations from people in general. 
For example, to the word table, most people respond “chair”; to dark, they say 
“light.” A set of frequency tables was prepared—one for each stimulus word— 
showing the number of times each response was given in a sample of 1,000 nor- 
mal adults. In scoring the test, an "index of commonality" was derived from the 
frequency values of the responses of each examinee, Comparisons of psychotic in- 
dividuals with normals suggested that psychotics obtain a lower index of com- 
monality than normals. 

The diagnostic use of word association techniques declined, however, with the 
gradual realization that response frequency also varies widely with age, socioeco- 
nomic and educational level, regional and cultural background, creativity, and 
other factors. Hence, proper interpretation of results requires gathering norms on 
many subgroups and updating them periodically, as word usage evolves. In addi- 
tion, the popularity of the traditional psychoanalytic concepts that stimulated the 
development of these techniques has also diminished (Rabin & Zlotogorski, 
1981). The Kent-Rosanoff test has nevertheless retained its position as a standard 
laboratory tool. Additional norms have been gathered in several countries, and 
the technique has been extensively employed in research on verbal behavior and 
personality (Goldfarb & Halpern, 1984; Isaacs & Chen, 1990; Jenkins.& Russell, 
1960; Palermo & Jenkins, 1963; Postman & Keppel, 1970; Van der Made-Van 
Bekkum, 1971). 

Another verbal projective technique, sentence completion, has been widely em- 
ployed in both research and clinical practice (P. A. Goldberg, 1965; Haak, 1990; 
D. H. Hart, 1986; Lah, 1989). In terms of length of responses, amount of struc- 
ture, and other respects, sentence completion tests occupy the middle ground be- 
tween word association and thematic techniques. Generally, the opening words, 
or sentence stems, permit an almost unlimited variety of possible completions. 
Examples might be: My ambition . ...; Women . . .; What worries me. . .; My 
mother. . . . The sentence stems are frequently formulated so as to elicit responses 
relevant to the personality domain under investigation. This flexibility of the sen- 
tence completion technique represents one of its advantages for clinical and re- 
search purposes. Nevertheless, some standardized forms have been published for 
more general application. 

A widely used example is the Rotter Incomplete Sentences Blank (RISB— 
Rotter & Rafferty, 1950), consisting of 40 sentence stems. The directions to the 
test taker read: “Complete these sentences to express your real feelings. Try to do 
every one. Be sure to make a complete sentence.” Each completion is rated on a 
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7-point scale according to the degree of adjustment or maladjustment indicated. 
Illustrative completions corresponding to each rating are given in the manual. 
With the aid of these specimen responses, fairly objective scoring is possible. The 
sum of the individual ratings provides a total adjustment score that can be used 
for screening purposes. The response content can also be examined clinically for 
mote specific diagnostic clues. The newly revised manual of the RISB includes 
updated normative information and a review of research studies conducted since 
1950 (Rotter, Lah, & Rafferty, 1992). 

Many other sentence completion tests have been developed for the assessment 
of different target populations and for a variety of research and psychodiagnostic 
uses? (for descriptions of several traditional instruments, see D. H. Hart, 1986; 
Lah, 1989; Rabin & Zlotogorski, 1981). Some interesting recent additions to this 
field include instruments designed to detect malingering during disability exami- 
nations, to predict managerial effectiveness, and to assess constructs, such as de- 
fense mechanisms, that may be relevant in the assessment of personality (Carson 
& Gilliard, 1993; N. L. Johnson & Gold, 1995; Timmons, Lanyon, Almer, & 
Curran, 1993). 


AUTOBIOGRAPHICAL MEMORIES 


One of the most recent and promising developments in the area of projective ver- 
bal techniques is the resurgence of interest in the use of autobiographical memories 
for personality assessment. Analyzing memories, especially those of early life, in 
order to understand recurrent or intractable conflicts in later life has, of course, 
been a staple in psychodynamic psychotherapy since the time of Freud.” In addi- 
tion, Alfred Adler, one of Freud's original followers who went on to found his own 
school of individual psychology, considered that the very first memory, specifi- 
cally, holds the key to understanding the individual's "style of life." As a result, 
Adlerian psychologists have used early memories as clinical tools, and occasion- 
ally in research, since the 1930s (see, e.g., Hafner, Fakouri, & Labrentz, 1982; 
Slavik, 1991). Other theorists have also acknowledged the central role that auto- 
biographical memories—usually viewed as constructions or projections rather 
than as true historical accounts—can play in the evolution of personality. By and 
large, however, after a flurry of interest in the early part of the twentieth century, 
this seemingly vital source of information concerning personality has not been 
consistently or systematically explored until recently. 

Since the early 1980s, as a result of the influence of the cognitive point of view 
in psychology, there has been a renewal of interest in autobiographical memory, 
in general, and in its special function in personality organization, in particular 


land Ee GER 
12An instrument that has been widely used in research is the Washington University Sentence Completion 
Test. Designed especially for assessing stages in the development of self-conceptualization, it is discussed in 
chapter 16 together with other measures of self-concept. 

13See Bruhn (1995a) fora brief history of the use of autobiographical memories in personality assessment. 
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(Bruhn & Last, 1982; Ross, 1991; Rubin, 1986; Singer & Salovey, 1993). The 
work of Arnold R. Bruhn has been outstanding in this area (Bruhn, 1984, 1985, 
1990a, 1990b). After reviewing the models previously applied be Freudians, 
Adlerians, and ego psychologists to the interpretation of early memories, he has 
proposed a new conceptual framework for them, as well as a more systematic way 
of employing them. In Bruhn’s cognitive-perceptual theory, autobiographical 
memories (EMs) are central to the understanding of personality. Thus, the devel- 
opment of a standardized method for gathering and interpreting them has been 
one of Bruhn’s priorities. The Early Memories Procedure (EMP—Bruhn, 1989, 
1992a, 1992b) is a self-administered paper-and-pencil instrument that samples 21 
autobiographical memories from the entire lifespan, not just childhood. The first 
part calls for six general or “spontaneous” memories, delimited primarily by spe- 
cific timeframes (i.e., the five earliest memories and a particularly important life- 
time memory); the second part comprises 15 specific or “directed” memories that 
explore a diverse set of events and areas that may be clinically relevant (e.g., a 
traumatic memory, one’s first punishment memory, or one’s happiest memory).!4 
In addition to narrative descriptions of each memory, the EMP includes various 
probes regarding the clarity, affective tone, significance, and various other ele- 
ments of the memories. Bruhn considers EMs of specific events to be stories—or 
metaphors—that reflect what peorle have consciously learned or intuited from 
their life experiences. He also believes that these stories are often inaccurate or 
distorted but maintains that their veracity is immaterial for clinical purposes be- 
cause, as is the case for other projective productions, the value of EMs lies in their 
power to reveal current concerns, attitudes, beliefs, and affective states. 

Although Bruhn and his associates have devised, and revised, a Comprehen- 
sive Early Memories Scoring System (CEMSS-R—Last & Bruhn, 1991), Bruhn’s 
approach to the scoring and interpretation of autobiographical memories is quite 
flexible.'? He views EMs as complex psychological phenomena whose explana- 
tion may require different theoretical models and, consequently, different scoring 
systems, In fact, he has advocated the development of customized scoring systems, 
or “boutique” systems, based on empirically observed aspects of the EMs of crite- 
tion groups, for the purposes of making specific predictions. Bruhn and his collab- 
orators have obtained promising data with scoring systems designed to predict 
proneness to delinquency and to violence (Davidow & Bruhn, 1990; Tobey & 
Bruhn, 1992). 

The EMP is clearly a technique still in the process of development. No norma- 
tive work has yet been undertaken. Although adequate levels of interjudge agree- 
ment have been obtained with several of the scoring categories that Bruhn. and 


The EMP directions also solicit memories of “an inappropriate sexual experience” and of “being physically or 
emotionally abused”; however, respondents may check boxes indicating they have had no such experiences. 
This option, as well as the paper-and-pencil format of the EMP, reduces the possibility of obtaining the kind of 
"suggested" memories of abuse that have concerned both mental health professionals and the general public 
(See, e.g, Loftuss199A), 


!5One of the interpretive techniques that Bruhn strongly recommends is the use of a precis—or concise sum- 
mary—of the EM, a procedure often used with the TAT and related instruments. 
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his associates have devised, there is not much empirical evidence of other types of 
reliability for EMs. Undoubtedly, obtaining these and other psychometric data on. ` 
the EMP may be problematic. As is the case with other projective material, the 
very act of categorizing and quantifying autobiographical memories entails a con- 
sequent loss of information that may be uniquely valuable and appropriate for un- 
derstanding the person in question. Nevertheless, the procedure has the potential 
to become a very useful tool for personality assessment, especially within the con- 
text of psychotherapy (see, e.g., Ritzler, 1993b). Furthermore, systematic samples 
of autobiographical memories are more likely to be clinically significant than 
other kinds of verbal material—such as dream reports, free speech samples, or sto- 
ries—that have been used for similar purposes and in similar ways. 


PERFORMANCE TECHNIQUES 


A large and amorphous category of projective techniques comprises many forms 
of relatively free self-expression. It is characteristic of all these techniques that 
they have been employed as therapeutic as well as diagnostic procedures. Through 
the opportunities for self-expression that these activities afford, it is believed, the 
individual not only reveals her or his emotional difficulties but also relieves them. 
The methods most frequently employed in this category are drawing and play 
techniques of various types, including the dramatic use of toys. Not surprisingly, 
the majority of these methods were specifically designed for the assessment of 
children, though in many cases they may also be used with adults. 


Drawing Techniques. Although almost every art medium, technique, and 
type of subject matter has been investigated in the search for significant diagnos- 
tic clues in the evaluation. of personality, special attention has centered on 
drawings of the human figure." A well-known early example is the Machover 
Draw-a-Person Test (D-A-P—Machover, 1949). In. this test, the individual is 
provided with paper and pencil and is told to “draw a person.” Upon completion 
of the first drawing, he or she is asked to draw a person of the opposite sex—or of 
a different gender—from that of the first figure. While the respondent draws, the 
examiner notes her or his comments, the sequence in which differents parts are 
drawn, and other procedural details. The drawing is usually followed by a series of 
questions to elicit specific information about the age, schooling, occupation, and 
other facts associated with the characters portrayed. The inquiry may include a 
request for the respondent to make up a story about each person drawn. 

The interpretation of the D-A-P as proposed by Machover is essentially quali- 
tative and abounds in sweeping generalizations based on single indicators, such as 
“Disproportionately large heads will often be given by individuals suffering from 


16For some intriguing anecdotal examples of the clinical value of early memories, including a comparison of EMs 
taken from the autobiographical writings of Nixon and Freud, see Bruhn (1995b). 
"The use of human figure drawings as a nonverbal measure of cognitive functioning is discussed in chapter 9. 
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organic brain disease." Although reference is made to “thousands of drawings" ex- 
amined in clinical contexts, and a few selected cases are cited for illustrative pur- 
poses, no systematic presentation of data accompanies the original published 
report of the test. In addition, subsequent validation studies by other investigators 
generally have failed to lend support to Machover's diagnostic interpretations 
(see, e.g., Klopfer & Taulbee, 1976, pp. 558—561). 


Another method for the utilization of human figure drawings (HFDs) of children. 


and young adolescents—one that is grounded on a more solid empirical founda- 
tion— has been devised by Koppitz (1968, 1984). Asa result of her abiding belief in 
the clinical usefulness of HFDs in the assessment of children, Koppitz developed 
and standardized two objective scoring systems using drawings produced by 1,856 
public school children between the ages of 5 and 12. One of the systems, based pri- 
marily on the Goodenough-Harris Drawing Test (see chap. 9) and on Koppitz’s own 
clinical experience, uses HFDs as a developmental test of mental maturity. The sec- 
ond, derived from the work of Machover and others, is a projective test of children's 
interpersonal attitudes and concerns. It consists of 30 “emotional indicators" that 
differentiated between the drawings of children with and without emotional prob- 
lems. These indicators rarely occurred among the normal children in the sample 
and, unlike the developmental set, are presumably unrelated to age and maturation 
level. They include: (a) quality signs, such as transparencies and shading of the face; 
(b) special features, such as tiny heads or grotesque figures; and (c) omissions of some 
expected items, such as the neck or the eyes. 

Some of the aggregate features of HFDs, such as bizarreness or total number of 
“emotional indicators,” appear to discriminate between children who show prob- 
lems and those who are well adjusted (D. T. Marsh, Linberg, & Smeltzer, 1991; 
Naglieri & Pfeiffer, 1992; Yama, 1990). However, both Koppitz and other investi- 
gators have warned against the use of single indicators or “signs” for diagnostic 
purposes. The consensus regarding HFDs seems to be that they can provide only a 
very general idea of the level of emotional adjustment of children. Furthermore, 
as far as diagnostic applications are concerned, most experts agree that drawings 
should be used only to generate hypotheses and that they must be interpreted in 
the context of other information about the individual (M. V. Cox, 1993; Knoff, 
1993; Tharinger & Stark, 1990). 

In spite of these cautions and restrictions, not only has the popularity of human 
figure drawings continued unabated but, in fact, several additional drawing tasks 
have been devised. One of the most widely used is the House-Tree-Person (H-T- 
P) technique which, as its name implies, requires the respondent to complete sep- 
arate drawings of a house, a tree, and a person (Buck, 1948, 1992). The 
characteristics and features of the drawings themselves, along with a fairly exten- 
sive inquiry following the drawing tasks, are typically used as a source of hypothe- 
ses about general areas of conflict and concern. A newer technique that seems to 
have unusual potential as a clinical instrument is the Kinetic Family Drawing 
(KFD—R. C. Burns, 1982; R. C. Burns & Kaufman, 1970, 1972). This test asks 


children to draw a picture of everyone in their family, including themselves, 
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“doing something.” The KFD has generated a very large number of investigations; 
in a recent review of the literature, Handler and Habenicht (1994) state that in 
spite of methodological problems in the research done on this technique, there 
are a number of promising findings that warrant further pursuit with more sophis- 
ticated analyses, such as multiple regression. The distance and degree of interac- 
tion between the figures in the KFD, for example, appear to be among the most 
psychologically meaningful features of the drawings. 

Ever more imaginative tasks continue to be devised. For example, the collabo- 
rative drawing technique literally requires a whole family, or a couple, to com- 
plete a single drawing cooperatively while in the presence of one or more 
therapists who observe closely the behavior of all participants (G. Smith, 1991). 
This interactive technique took its inspiration from the KFD and is used primarily 
in the context of family therapy. 


materials have subsequently been adapted for the diagnostic testing of both adults 
and children. The objects are usually selected because of their associative value. 
Among the articles most frequently employed for these purposes, for example, are 
dolls representing adults and children of both sexes, animals, furniture, bathroom 
and kitchen figures, and other household furnishings. The Scenotest, pictured in 
Figure 15-4, consists of a standardized set of equipment of this type and an 
accompanying handbook. It was published in Switzer d in the 1960s and has 
recently become available in the United States (Staabs, 1991). Play with such 
articles is expected to reveal the child’s attitude toward her or his own family, as 
well as sibling rivalries, fears, aggressiveness, conflicts, and the like. The examiner 
notes what items the child chooses and what he or she does with them, as well as 
the child’s verbalizations, emotional expressions, and other overt behavior. 

With children, the examiner simply makes available the collection of toys for 
free play. With adults, the materials are presented with general instructions to 
carry out some. task of a highly unstructured nature. These instructions may, of 
course, also be employed with children. Frequently, the task has dramatic features, 
as in the arrangement of figures on a miniature stage set. The materials for the 
Scenotest, for example, come in a flat, portable case that has a lid which can actu- 
ally be used as a "stage" with the various figures and accessories (see Fig. 15-4). 

Play techniques for the diagnosis and assessment of children have been cata- 
logued in a comprehensive volume edited by Schaefer, Gitlin, and Sandgrund 
(1991). In addition to projective tools, such as puppet techniques, this work 


i) See 

: jecti i i ioned in this 
18Further discussion of the use of projective drawings, and reviews of several of the techniques mentione: 

section, can be found in Cummings (1986), Hammer (1986), Handler (1996), and Knoff (1990). ; 
rpretation of projective play, from a viewpoint that emphasizes 
Krall (1986). 


19For a general discussion of procedures and interpr 
psychoanalytic theory and developmental perspectives, see 
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Figure 15-4. Standardized Materials from the Scenotest. 


(Courtesy of Hogrefe & Huber Publishers. Copyright © 1994 by Hogrefe & Huber Publishers. Reproduced by 
permission.) f 


describes a wide assortment of play scales for the assessment of specific prob- 
lems—ranging from autism to hyperactivity—and for the developmental evalua- 
tion of such areas as competence, mastery motivation, and infant temperament. 
Also included are scales for use in play therapy and for assessing parent-child and 
peer interaction. As Schaefer and his collaborators acknowledge, many of the 
techniques they present are still in the initial stages of development. Neverthe- 
less, the diverse collection of approaches they have gathered includes some inge- 
nious measures and offers both highly structured formal observation methods that 
might be best suited for research as well as many clinically oriented instruments. 


EVALUATION OF PROJECTIVE TECHNIQUES 


It is evident that projective techniques differ widely among themselves. Some ap- 
pear more promising than others because of more favorable empirical findings, 
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sounder theoretical orientation, or both. Regarding some techniques, such as the 
Rorschach, voluminous data have been gathered, although their interpretation is 
still often uncertain. About others, little is known, either because of their recent 
origin, or because objective verification is hindered by the intrinsic nature of the 
instruments or by the attitudes of their exponents. 

A related observation is that the differences between projective techniques 
and standardized tests are not as large or as fundamental as may appear at first 
sight. Not only in their psychometric properties, but also in the nature of the task 
presented to test takers and the ways in which the results are interpreted, it has 
been argued convincingly that projective techniques and self-report inventories 
differ in degree rather than in kind (Levy, 1963). Individual instruments fall into 
a continuum; although at the extremes the differences are easily recognizable, 
overlapping in several features is evident at the center. 

To evaluate each projective instrument individually and to attempt to summarize 
the extensive pertinent literature would require a separate volume. Within this 
chapter, critical comments have been interjected only in the cases of instruments 
that presented unique features—whether of a favorable or unfavorable nature. Sev- 
eral general questions, however, can be raised to some extent about most projective 
techniques. These questions can be conveniently considered in summary form. 


Rapport and Applicability. Most projective techniques represent an effec- 
tive means for “breaking the ice” during the initial contacts between clinician 
and client. The task is usually intrinsically interesting and often entertaining. It 
tends to divert the individual’s attention away from herself or himself and thus 
reduces embarassment and defensiveness- And it offers little or no threat to the 
respondent's prestige, since any response one gives is “right.” 

Certain projective techniques may be especially useful with young children, il- 
literates, and persons with language difficulties or speech defects. Nonverbal 
media would be readily applicable to all these groups. And oral responses to picto- 
rial and other nonlanguage stimuli could be secured from the first two. With all 
these verbally limited groups, projective techniques may help the test taker to 
communicate with the examiner. These techniques may also aid individuals in 
clarifying for themselves some aspects of their own behavior that they had not 


previously verbalized. 


Faking. In general, projective instruments are less susceptible to faking than 
are self-report inventories. The purpose of projective techniques is usually 
disguised.?° Even if an individual has some psychological sophistication and is 
familiar with the general nature of a particular instrument, such as the Rorschach 
or TAT, it is still unlikely that he or she can predict the intricate ways in which 
the responses will be scored and interpreted. As the results of a recent series of 
studies comparing an objective and a projective measure of dependency suggest, 


20To the extent that the examinee is unaware of the nature and purpose of the assessment within which these 
techniques are used, their disguised nature presents an ethical problem (see chap. 18). 
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there seems to be an inverse relationship between the face validity of a test and its 
susceptibility to faking (Bornstein, Rossner, Hill, & Stepanian, 1994). Moreover, 
the respondent soon becomes absorbed in the task and hence is less likely to 
resort to the customary disguises and restraints of interpersonal communication. 

On the other hand, it cannot be assumed that projective tests are completely 
immune to faking. Several experiments with the Rorschach, TAT, and other pro- 
jective instruments have shown that significant differences do occur when re- 
spondents are instructed to alter their responses so as to create favorable or 
unfavorable impressions, or when they are given statements suggesting that cer- 
tain types of responses are more desirable (Masling, 1960). There is considerable 
experimental evidence that responses to projective tests can in fact be success- 
fully altered in both a “fake good" and a "fake bad” direction, although the latter 
may be easier to accomplish, Such results have been obtained with several projec- 
tive instruments, including the Rorschach, TAT, Rosenzweig P-F Study, and sen- 
tence completion tests (Albert, Fox, & Kahn, 1980; Exner, 1991; Kaplan & Eron, 
1965; Meltzoff, 1951; Netter & Viglione, 1994; Perry & Kinder, 1990; Schwartz, 
Cohen, & Pavlik, 1964). The skilled examiner is alert to signs of faking, both in 
the nature of individual responses and response patterns and in inconsistencies 
with other sources of data about the respondent. 


Examiner and Situational Variables. It is obvious that most projective 
techniques are inadequately standardized with respect to both administration and 
scoring, or are not used in a standardized fashion in clinical practice. Yet there is 
evidence that even subtle differences in the phrasing of verbal instructions and in 
examiner-examinee relationships can appreciably alter performance on these 
tests (Baughman, 1951; Exner, 1993; Hamilton & Robertson, 1966; Herron, 
1964; Klinger, 1966; Klopfer & Taulbee, 1976). Even when employing identical 
instructions, some examiners may be more encouraging or reassuring, others more 
threatening, owing to their general manner and appearance. Such differences 
may affect response productivity, defensiveness, stereotypy, imaginativeness, and 
other basic performance characteristics. In the light of these findings, problems of 
administration and testing conditions assume even greater importance than in 
other psychological tests. 

Equally serious is the lack of objectivity in scoring and interpretation. Even 
when objective scoring systems have been developed and followed, the final steps 
in the evaluation and integration of the raw data usually depend on the skill and 
clinical experience of the examiner. The most disturbing implication of this situ- 
ation is that the interpretation of scores is often as projective for the examiner as 
the test stimuli are for the examinee. In other words, the final interpretation of 
projective test responses may reveal more about the theoretical orientation, fa- 
vorite hypotheses, and personal idiosyncrasies of the examiner than it does about 
the examinee’s personality dynamics. 


Norms. Another conspicuous deficiency common to many projective 
instruments pertains to normative data. Such data may be completely lacking, 


Projective Techniques 449 


grossly inadequate, or based on vaguely described populations. In the absence of 
adequate objective norms, the clinician falls back on her or his “general clinical 
experience” to interpret projective test performance. But such a frame of 
reference is subject to all the distortions of memory that are themselves 
reflections of theoretical bias, preconceptions, and other idiosyncrasies of the 
clinician. Moreover, any one clinician’s contacts may have been limited largely to 
persons who are atypical in education, socioeconomic level, sex ratio, age 
distribution, or other relevant characteristics. In at least one respect the 
clinician’s experience is almost certain to produce a misleading picture, since he 
or she deals predominantly with disturbed or pathological individuals. The 
clinician may thus lack enough first-hand familiarity with the characteristic 
reactions of normal people on the tests. The Rorschach norms gathered by Exner 
represent efforts to correct some of the more obvious lacks in this regard. 

Interpretation of projective test performance often involves subgroup norms, of 
either a subjective or an objective nature. Such norms may lead to faulty interpre- 
tations unless the subgroups were equated in other respects. For example, if the 
schizophrenics and normals on whom the norms were derived differed also in ed- 
ucational level, the observed disparities between schizophrenic and normal per- 
formance may have resulted from educational inequality rather than from 
schizophrenia. Systematic or constant errors may also operate in the comparison 
of various psychiatric syndromes. For instance, there is some evidence that clini- 
cians have a tendency to overdiagnose schizophrenia, instead of bipolar disorder, 
in some ethnic groups and in younger individuals; similarly, conversion disorder 
has been reported to be more common in individuals of lower socioeconomic sta- 
tus (American Psychiatric Association, 1994). 


Reliability.2!. In view of the special nature of scoring procedures and the 
inadequacies of normative data in projective testing, scorer reliability becomes an 
important consideration"? For projective techniques, a proper measure of scorer 
reliability should include not only the more objective, preliminary scoring but 
also the final integrative and interpretive stages. It is not enough, for example, to 
demonstrate that examiners who have mastered the same system of Rorschach 
scoring agree closely in their tallying of such characteristics as whole, unusual 
detail, or color responses. On an instrument like the Rorschach, these. raw 
quantitative measures cannot be interpreted directly from a table of norms, asin 
the usual type of psychological test. Interpretive scorer reliability is concerned 
with the extent to which different examiners attribute the same. personality 
characteristics to the respondent on the basis of their interpretations of the 
identical record. Few adequate studies of this type of scorer reliability have been 


21For an informative discussion of the reliability issues pertaining to thematic apperception measures, with spe- 
cial reference ro the assessment of motive variables, see C. P. Smith (1992). 

2?In recognition of the importance of ensuring an adequate level of objectivity in scoring practices, the Joumal of 
Personality Assessment, since 1991, has required authors of articles that report Rorschach research to submit evi- 
dence of at least 80% agreement between two or more scorers for all major scoring categories. 
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conducted with projective tests. Some investigators have revealed marked diver- 
gencies in the interpretations given by reasonably well-qualified test users. A 
fundamental ambiguity in such results stems from the unknown contribution of 
the interpreter’s skill. Neither high nor low scorer reliability can be directly 
generalized to other scorers differing appreciably from those used in the particular 
investigation. In fact, one of the major reasons for the widespread popularity of 
computer-based interpretation systems for tests like the Rorschach is the unifor- 
mity of their results at the interpretive level. 

Attempts to measure other types of test reliability in the field of projective 
testing have fared even less well. Coefficients of internal consistency, when com- 
puted, have usually been low. In such tests as the Rorschach, TAT, and Rosen- 
zweig P-F Study, it has been argued that different cards or items are not 
comparable and hence should not be used in finding split-half reliabilities. In 
fact, individual items in such instruments were designed to measure different 
variables. Moreover, the trend of responses over successive items is often consid- 
ered significant in interpretation. J. W. Atkinson (1981; J. W. Atkinson & Birch, 
1978, pp. 370-374) has demonstrated through computer simulation that, with a 
TAT-like procedure, it is possible to obtain high construct validity of total scores 
(e.g., .90) when internal consistency is very low (e.g., .07). He observes that the 
individual does not respond to each successive card independently but responds 
through a continuous stream of activity that reflects the rise and fall in relative 
strength of different behavior tendencies. The expression of a tendency in activ- 
ity reduces its strength. The proportion of time the respondent spends describ- 
ing, for example, achievement-motivated activities in response to different cards 
is a function of the cumulative effect of responding to successive cards and the 
differential incentives for achievement and other competing motives in the indi- 
vidual cards. In view of the various arguments against the applicability of inter- 
nal consistency measures of reliability to projective tests, one solution is to 
construct parallel forms that are, in fact, comparable as was done in the Holtzman 
Inkblot Technique. 

Retest reliability also presents special problems. With long intervals, genuine 
personality changes may occur that the test should detect. With short intervals, a 
retest may show no more than recall of original responses. When investigators in- 
structed respondents to write different TAT stories on a retest, in order to deter- 
mine whether the same themes would recur, most of the scored variables yielded 
insignificant retest correlations (Lindzey & Herman, 1955). It is also relevant to 
note that many scores derived from projective techniques are based on very inad- 
equate response samples. In the case of the Rorschach, for instance, the number 
of responses within a given individual’s protocol that fall into such categories as 
animal movement, human movement, shading, color, unusual detail, and the like 
may be so few as to yield extremely unreliable indices. Large chance variations are 
expected under such circumstances. Ratios and percentages computed with such 
unreliable measures are even more unstable than the individual measures them- 


selves (Cronbach, 1949, pp. 411-412). 
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Validity. For any test, the most fundamental question is that of validity. Many 
validation studies of projective tests have been concerned with concurrent 
criterion-related validation. Most of these have compared the performance of 
contrasted groups, such as occupational or diagnostic groups. As was pointed out in 
connection with norms, however, these groups often differ in other respects, such as 
age or education. Other investigations of concurrent validity have used essentially a 
matching technique, in which personality descriptions derived from test records are 
compared with descriptions or data about the same persons taken from case 
histories, psychiatric interviews, or long-range behavioral records. A few studies 
have investigated predictive validity against such criteria as success in specialized 
training, job performance, or response to psychotherapy. There has been an 
increasing trend to investigate the construct validity of projective instruments by 
testing specific hypotheses that underlie the use and interpretation of each test. 

The large majority of published validation studies on projective techniques are 
inconclusive because of procedural deficiencies in either experimental controls or 
statistical analysis, or both. Some methodological deficiencies may have the effect 
of producing spurious evidence of validity where none exists. An example is the 
contamination of either criterion or test data. Thus, the criterion judges may 
have had some knowledge of the respondent's test performance. Similarly, the ex- 
aminer may have obtained cues about the respondent's characteristics from con- 
versation with her or him in the course of test administration, or from case history 
material and other nontest sources. The customary control for the latter type of 
contamination in validation studies is to utilize blind analysis, in which the test 
record is interpreted by a scorer who has had no contact with the respondent and 
who has no information about her or him other than that contained in the test 
protocol. Clinicians have argued, however, that blind analysis is an unnatural way 
to interpret projective test responses and does not correspond to the way these in- 
struments are used in clinical practice. 

Another common source of spurious validity data is failure to cross-validate 
(Kinslinger, 1966). Because of the large number of potential diagnostic signs or 
scorable elements that can be derived from most projective tests, it is very easy by 
chance alone to find a set of such signs that differentiate significantly between cri- 
terion groups. The validity of such a scoring key, however, will collapse to zero 
when applied to new samples. 

A more subtle form of error is illustrated by stereotype accuracy. Certain descrip- 
tive statements, such as might occur in a Rorschach protocol, may apply widely to 
persons in general, or to young men, or to hospitalized patients, or to whatever cat- 
egory of persons is sampled by the particular investigation. ? Agreement between 
criterion and test data with regard to such statements would therefore yield a spuri- 
ous impression of validity. Some check on this error is needed, such as a measure of 


“Barnum effect" cited in chapter 17 


RE oos 
23The use of such generally applicable statements is an example of the 
T ive research on this effect, see Klopfer 


(Dunnette, 1957; Mech), 1956). Fora well-balanced survey on the extensi 
(1983, pp. 510-514). 
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the agreement between the test evaluation of one respondent and the criterion 
evaluation of another respondent in the same category. This measure would indi- 
cate the amount of spurious agreement resulting from stereotype accuracy under the 
conditions of the particular investigation (see, e.g., Silverman, 1959). 

Still another common source of error, arising from reliance on clinical experi- 
ence in the validation of diagnostic signs, is what Chapman (1967) labeled "illu- 
sory validation." This phenomenon may account in part for the continued use of 
instruments and systems of diagnostic signs for which empirical validity findings 
are predominantly negative. In a classic series of experiments to test this phenom- 
enon, Chapman and Chapman (1967) presented college students with a set of 
human figure drawings similar to those obtained in the Machover Draw-a-Person 
Test (D-A-P). The results showed that participants responded in terms of 
preestablished popular stereotypes even though these associations were not sup- 
ported by the data that had been presented to them during their experimental 
“learning experience." For example, they listed atypical eyes as being associated 
with suspiciousness; a large head with worry about intelligence; and broad shoul- 
ders with concern about manliness. Not only were the interpretations unrelated 
to the empirical associations which the participants had “studied,” but in other 
experiments these stereotyped cultural associations also proved to be resistant to 
change under intensive training conditions designed to establish counter associa- 
tions. In other words, persons retained their a priori expectations even when ex- 
posed to contradictory observations. 

Illusory validation is a special example of the mechanism that underlies the 
survival of superstitions. We tend to notice and recall whatever fits our expecta- 
tions; and we tend to ignore and forget whatever is contrary to our expectations. 
This mechanism may actually interfere with the discovery and use of valid diag- 
nostic signs in the course of clinical observation by clinicians who are strongly 
wedded to a particular diagnostic system. The original research of Chapman and 
Chapman with the D-A-P has been corroborated by similar studies with the 
Rorschach and with the Incomplete Sentences Blank (Chapman & Chapman, 
1969; Golding & Rorer, 1972; Starr & Katkin, 1969). 

It should be noted, on the other hand, that certain inadequacies of experimen- 
tal design may have the opposite effect of underestimating the validity of a diagnos- 
tic instrument. It is widely recognized, for example, that traditional psychiatric 
categories, such as schizophrenia or depersonalization disorder, represent crude 
classifications of the disturbances actually manifested by patients. Hence, if such 
diagnostic categories are used as the sole criterion for checking the validity of a 
personality test, negative results are inconclusive. Similarly, failure to predict oc- 
cupational criteria may reflect no more than the examiner's ignorance of the traits 
required for the jobs under consideration. When such criteria are employed, it is 
possible that the projective test is a valid measure of the personality traits it is de- 
signed to measure, but that these traits are irrelevant to success in the chosen cri- 
terion situations. 

An increasing number of test users stress the importance of holistic and inte- 
grative principles in personality assessment, such as the use of configural patterns 
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and contextual variables. Many of them have been critical of continuing attempts 
to validate single indicators, isolated scores, or diagnostic "signs" derived from 
projective techniques. That insignificant correlations may result from failure to 
allow for complex patterns of relationships among personality variables can be il- 
lustrated by a plethora of contradictory findings about many of the projective 
techniques that clinicians find most helpful. For example, the hypothesized rela- 
tion between aggression in fantasy, as revealed in the TAT, and aggression in 
overt behavior is not a simple one. Depending on other concomitant personality 
characteristics, such as level of anxiety or fear of punishment, high aggression in 
fantasy may be associated with either high or low overt aggression (Harrison, 
1965; Mussen & Naylor, 1954). 

Lack of significant correlation between expression of aggression in TAT stories 
and in overt behavior in a random sample of individuals is thus consistent with. 
expectation, since the relation may be positive in some persons and negative in 
others. Obviously, however, such a lack of correlation is also consistent with the 
hypothesis that the test has no validity at all in detecting aggressive tendencies. 
What is needed, of course, is more studies using complex experimental designs 
that permit an analysis of the conditions under which each assumption is 
applicable. 


The Projective Hypothesis. The traditional assumption regarding projec- 
tive techniques has been that the individual's responses to the ambiguous stimuli 
presented to her or him reflect significant and relatively enduring personality 
attributes. Although it is clear that projective test responses can and do reflect the 
response styles and abiding traits of individuals, a large and growing body of 
research. indicates that many other factors can affect those responses. To the 
extent that retest, reliability has been measured, marked temporal shifts have 
frequently been observed, indicating the operation of considerable chance error. 
More direct evidence regarding the susceptibility of projective test responses to 
temporary states is provided by several experimental studies demonstrating the 
effect on such responses of factors such. as hunger, sleep deprivation, drugs; 
anxiety, and frustration. Significant response differences have likewise been found 
in relation to instructional sets, examiner characteristics, and the respondent's 
perception of the testing situation. Ability factors—and particularly verbal 
ability—clearly affect scores on most projective tests. In the light of all these 
findings, projective test responses can be meaningfully interpreted only when the 
examiner has extensive information about the circumstances under which they 
were obtained and the aptitudes and experiential background of the respondent. 

From another angle, the advantages of using unstructured or ambiguous stimuli 
have been questioned (Epstein, 1966). Such stimuli are ambiguous for the exam- 
iner as well as for the examinee; thus they tend to increase ambiguity in the inter- 
pretation of the examinee's responses. With structured stimuli, on the other hand, 
it is possible to select stimuli relevant to the personality characteristics to be as- 
sessed and to vary the nature of the stimuli to explore fully a given personality di- 
mension. Such a procedure makes for clearer interpretation of test performance 
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than is possible with the shotgun approach of unstructured stimuli. There is also 
evidence against the common assumption that the less structured the stimuli, the 
more likely they are to elicit projection and to tap “deep” layers of personality 
(Klopfer & Taulbee, 1976; Murstein, 1963). Actually, the relation between ambi- 
guity and projection appears to be nonlinear, with an intermediate degree of am- 
biguity representing an optimum for purposes of projection. 

The assumption that fantasy, as elicited by such projective techniques as the 
TAT, reveals covert motivational dispositions has also been called into question. 
For example, in a 20-year longitudinal study of TAT fantasy and relevant overt 
behavior, adolescent activities predicted adult TAT imagery much better than 
adolescent TAT imagery predicted adult activities (McClelland, 1966; Skolnick, 
1966). In one instance, individuals who had shown upward social mobility ob- 
tained higher scores on achievement needs as adults; but those who obtained 
higher achievement need scores in adolescence were not among the ones who 
subsequently showed upward social mobility. 

Findings such as these reverse the relationship implied by the traditional ratio- 
nale of projective techniques. They can be explained if we regard TAT responses 
not as direct projective expressions of motives, but as samples of the individual's 
thoughts, which may in turn have been influenced by her or his previous actions. 
Individuals who have achieved more and those who were more often exposed to 
achievement-oriented models in the developmental history tend to perceive 
more achievement themes in ambiguous pictures. 

In summary, many types of research have tended to cast doubt about various as- 
pects of the projective hypothesis. There is ample evidence that a variety of alter- 
native explanations may account for the individual's responses to unstructured or 
ambiguous test stimuli. 


Projective Techniques as Psychometric Instruments. Many. projective 
techniques are clearly found wanting when evaluated in accordance with test 
standards. This is evident from the data summarized in the preceding sections 
with regard to standardization of administration and scoring procedures, adequacy 
of norms, reliability, and validity. The accumulation of published studies that 
have failed to demonstrate any validity for such techniques as the TAT and 
the D-A-P is truly impressive, even after discounting the methodological 
inadequacies of many of the studies. Yet after several decades of contradictory 
results, the use of projective techniques remains substantially unchanged and 
may, in fact, be undergoing a growth spurt. In the words of one reviewer, "There 
a E enthusiastic clinicians and doubting statisticians” (Adcock, 1965, 
p. k 

This apparent contradiction can perhaps be understood if we recognize that, 
with a few exceptions, projective techniques are not truly tests. Among the no- 
table exceptions are the Holtzman Inkblot Technique, the work of Exner with the 
Rorschach, some adaptations of the TAT, some sentence completion tests, and 
the Rosenzweig P-F Study. A few other examples of quasi-tests could undoubtedly 
be found among the many remaining projective techniques that were not dis- 
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cussed in this chapter. Even in the case of these instruments, however, there is 
need for much more validity data to specify the nature of the constructs measur 
by their scores, as well as for more normative data on clearly defined populations. 
Thus, while coming closer to meeting test standards than have other projective 
techniques, most of these instruments are not ready for routine operational use in 
helping to make decisions and predictions about people. 


Projective Techniques as Clinical Tools. Rather than being regarded and 
evaluated as psychometric instruments, OF tests in the strict sense of the term, 
most projective instruments have come to be regarded more as clinical tools. 
Thus, they may serve as supplementary qualitative interviewing aids in the hands 
of a skilled clinician. Their value as clinical tools is proportional to the skill of the 
clinician and hence cannot be assessed independently of the individual clinician 
using them. Attempts to evaluate them in terms of the usual psychometric 
procedures may thus be inappropriate. But by the same token, the use of elaborate 
scoring systems that yield quantitative scores is not only wasteful but also 
misleading. Such scoring procedures lend the results an illusory semblance of 
objectivity and may create the unwarranted impression that the given technique 
can be treated as a test. The special value that projective techniques may have is 
more likely to emerge when they are interpreted. by qualitative, clinical 
procedures than when they are quantitatively scored and interpreted as if they 
were objective psychometric instruments. 

Borrowing a concept from information theory, Cronbach and Gleser (1965) 
characterized interviewing and projective techniques as “wideband” procedures. 
Bandwidth, or breadth of coverage, is achieved at the cost of lowered fidelity or 
dependability of information. Objective psychometric tests characteristically 
yield a narrow band of information at a high level of dependability. In contrast 
projective and interviewing techniques provide a much wider range of informa- 
tion of lower dependability. Moreover, the kinds of data furnished by any one pro- 
jective technique may vary from individual to individual. One person's TA 
responses, for example, may tell us a good deal about her or his aggression and lit- 
tle or nothing about creativity or achievement need; another person’s record may 
permit a thorough assessment of the degree of creativity and of the strength of 
achievement need, while revealing little about her or his aggression. Such a lac! 
of uniformity in the kinds of information provided in individual cases helps to ex- 
plain the low validities found when projective test responses are analyzed for any 
single trait across a group of persons. í v dou de 

It is interesting to note that a similar unevenness characterizes clinicians In- 
terpretations of individual records. Thus, in an early study of the validity of the 


TAT, Henry and Farley (1959, p. 22) concluded: 


There is no single correct way of employing the TAT interpretation. There was lit- 
tle item agreement between judges, but each judge made enough “correct” decisions 
to yield a highly significant agreement figure. Judges may arrive at essentially the 
same interpretive implications of the test report, by quite different routes; or judges 
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may differ individually in their ability to utilize TAT predictions in different areas 

... or for different subjects. : 

The nature of clinical judgment through which projective and interviewing 
data may be used in reaching decisions about individual clients is receiving in- 
creasing attention from psychologists (see chap. 17). In this process, the very con- 
structs or categories in terms of which the data are organized are built up 
inductively through an examination of the particular combination of data avail- 
able in each case. The special function of the clinician is to make predictions 
from unique or rare combinations of events about which it is impracticable to pre- 
pare any statistical table or equation. By creating new constructs to fit the indi- 
vidual, the clinician can predict from combinations of events he or she has never 
encountered before. In making these predictions, the clinician can also take into 
account the varied significance of similar events for different individuals. Such 
clinical predictions are helpful, provided they are not accepted as final but are 
constantly tested against information elicited through subsequent inquiry, test re- 
sponses, reaction to therapy, or other behavior on the part of the client. It follows 
from the nature of interviewing and projective techniques that decisions should 
not be based on any single datum or score obtained from such sources. These 
techniques serve best in sequential decisions, by suggesting leads for further ex- 
ploration or hypotheses about the individual for subsequent verification. 


(ther Assessment Techniques 


; he self-report inventories and projective techniques surveyed in the pre- 
^14 ceding chapters represent the best-known and most widely used instru- 
_ ments for: personality appraisal. Nevertheless, there still remains a rich 

— supply of other devices that are being explored for this purpose. Out of this 
diversity of approaches may come techniques that will eventually stimulate 
progress in new directions. The procedures considered in this chapter are princi- 
pally research techniques, although some may also serve as supplementary assess- 
ment tools in applied contexts, such as counseling or organizational psychology 
work. A wide variety of approaches is represented by the specific techniques cited. 
Several of them are hard to classify because they assess constructs that span the 
realms of abilities and personality. Three major categories include measures of 
cognitive styles and personality types; situational tests; and techniques designed 
to assess self-concepts and personal constructs. To add further perspective to this 
survey, some attention is given to the use of nontest techniques in personality as- 
sessment,» including naturalistic “observations, interviewing, ratings, and’ the 
analysis of life-history data. 


MEASURES OF STYLES AND TYPES 


Even though, in this book and elsewhere, ability and personality testing are dis- 
cussed separately, in actual test use and particularly in. test score interpretation 
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the cognitive and affective realms cannot be kept apart.’ The behavior samples 
that make up psychological tests are cross-sections of a person's behavioral reper- 
toire; as such, they contain information about all aspects of the person at once. In 
chapter 11, for example, we reviewed evidence of the growing consensus that ap- 
titudes cannot be investigated independently of affective variables because per- 
formance on ability tests is clearly dependent on achievement drive, persistence, 
values, and the like. Similarly, in chapter 13, we discussed the idea that individu- 
als differ in the consistency of their behavior across situations as a function of 
how they perceive and categorize situations, which, in turn, depends on their 
prior experience and learning. Moreover, throughout the chapters in Parts Three 
and Four, we have considered measures purportedly aimed at assessing either abil- 
ity or personality that also include aspects pertinent to affective or cognitive 
functioning, respectively. 

There are a number of ways of dealing with the multiplicity of factors inherent 
in human behavior. We can, for instance, simply examine the correlation be- 
tween measures of traits like anxiety and problem-solving ability (see, e.g., Zeid- 
ner, 1995). We can also use multivariate techniques, such as factor analysis and 
multidimensional scaling, to separate the components within a set of behavioral 
data (Jones & Sabers, 1992). Structural schemes that accommodate several vari- 
ables and represent their interrelationships topographically can be devised as 
well. The hierarchical model of the organization of abilities developed by Vernon 
(see chap.11), Holland's hexagonal representation of occupational themes (see 
chap. 14), and the interpersonal circumplex model (see, e.g., Hofstee et al., 1992) 
are examples of this type of arrangement within single realms of behavior.” Yet 
another approach is to use categories that are in themselves complex and that en- 
compass both cognitive and affective elements. Emotional intelligence (Salovey 
& Mayer, 1990; Mayer & Salovey, 1993) is one such construct that has recently 
garnered a good deal of attention (see, e.g., Goleman, 1995). Cognitive styles and 
personality types, discussed in the next two sections, also exemplify this kind of 
approach. They represent attempts to capture qualitative differences in the pat- 
terns or configurations of human behavior. 


Cognitive Styles. Cognitive styles refer essentially to one’s preferred and typi- 
cal modes of perceiving, remembering, thinking, and problem solving (Messick et 
al., 1976). They are regarded as broad stylistic behavioral characteristics that cut 
across abilities and personality and are manifested in many activities and media. 
An extensive research literature has accumulated on various cognitive styles and 
related concepts, such as learning and thinking styles (Brodzinsky, 1982; 
Furnham, 1995; Globerson & Zelniker, 1989; Goldstein & Blackman, 1978a, 


IThe need to integrate these domains in general psychological theory and research, as well as in practice, is 
being increasingly recognized. See, for example, the two recent volumes devoted to the interface between per- 
sonality and intelligence edited by Saklofske and Zeidner (1995) and Sternberg and Ruzgis (1994). 

2The spherical representation of vocational interests proposed by Tracey and Rounds (1996), mentioned in 
chapter 14, is an example of a more complex scheme involving multiple dimensions. 
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1978b; Grigorenko & Sternberg, 1995; Jonassen & Grabowski, 1993; Kogan, 
1976; Messer, 1976; Sternberg, 1994; Witkin & Goodenough, 1981). 

One of the principal sources of differentiation of cognitive styles can be found 
in the area of perceptual functions. A large body of experimental literature has 
demonstrated significant relationships between the individual's attitudinal, moti- 
vational, or emotional characteristics and her or his performance on perceptual or 
cognitive tasks. It should also be recognized that a number of projective tech- 
niques—notably the Rorschach—may be regarded as essentially perceptual tests 
(see, e.g., Blatt, 1990). 

Of the factors identified in early factorial analyses of perception, one that has 
proved particularly fruitful in personality research is flexibility of closure (Pem- 
berton, 1952; Thurstone, 1944). A common type of test for this factor requires 
the identification of a figure amid distracting and confusing details. Two items 
from a test with a high loading in this factor (Gottschaldt Figures) are shown in 
Figure 16-1, Several early studies reported suggestive data indicating possible re- 
lationships between this perceptual factor and personality traits. In one investiga- 
tion, for example, persons who. excelled. in. flexibility of closure had high 
self-ratings on trait descriptions such as socially retiring, independent of the opin- 
ion of others, analytical, interested in theoretical and scientific problems, and dis- 
liking. systematization and routine (Pemberton, 1952). Over the intervening 
years, investigators have employed adaptations of the Gottschaldt Figures in re- 
search.on both cognitive and noncognitive behavior. 

Approaching the problem from a different angle, Witkin. and his associates 
(Witkin et al., 1954/1972) identified the ability to resist the disruptive influence 


Flexibility. of Closure (Gottschaldt Figures): Which of the four 
drawings at the right contain the design at the left? 


Figure 16-1. Items Illustrating Perceptual Tasks Employed in 
Personality Assessment. 
(From Thurstone, 1950, p. 7) 
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of conflicting contextual cues as an important variable in the long-term study of 
perceptual space orientation. Through various tests utilizing a rod and frame that 
could be independently moved, a tilting chair, and a tilting room, these investiga- 
tors were able to show that individuals differ widely in their “field dependence,” 
or the extent to which their perception of the upright is influenced by the sur- 
rounding visual field. Odd-even and retest reliabilities were high, and most of the 
intercorrelations among the different spatial orientation tests were significant. 
Thus, a substantial body of data was amassed to indicate that field dependence is a 
relatively stable, consistent trait, having a certain amount of generality. 

Of even more interest are the significant correlations between these orienta- 
tion tests and the Embedded Figures Test (similar to the Gottschaldt Figures illus- 
trated in Fig. 16-1), which has been considered as a measure of field dependence in 
a purely visual, paper-and-pencil situation. As more research accumulated, field 
dependence-independence came to be regarded as the perceptual component ofa 
broader personality dimension, designated as global versus articulated cognitive 
style, or psychological differentiation (Witkin, Dyk, Faterson, Goodenough, & 
Karp, 1962/1974). There is evidence that this cognitive style exhibits consider- 
able stability through childhood and early adulthood and is related to a number of 
personality variables, such as leadership (Weissenberg & Gruenfeld, 1966), social 
conformity (Witkin et al, 1974), and many others (see, e.g., Jonassen & 
Grabowski, 1993, chap. 7). 

The scope and diversity of research on field dependence are truly impressive, 
ranging from interpersonal relations (Witkin & Goodenough, 1977) to learning 
and memory (D. R. Goodenough, 1976), mathematics achievement (Vaidya & 
Chansky, 1980), choice of field of study in college and graduate school (Raskin, 
1985), cross-cultural differences (J. W. Berry, 1976), and work environment pref- 
erences (Wooten, Barner, & Silver, 1994). An example of the intriguing relation- 
ships that emerge from surveys of many studies is the finding that 
field-independent persons tend to follow active, “participant” approaches to learn- 
ing, whereas field-dependent persons more often use “spectator” approaches. One 
recent investigation comparing the outcomes of multiple-choice tests and perfor- 
mance-based assessments, for instance, suggests that the latter favor field-indepen- 
dent students (Lu & Suen, 1995). In interpersonal situations, on the other hand, 
the field-dependent tend to have certain advantages in getting along with others. 
They tend to be more attentive to social cues, more responsive to other persons’ 
behavior, and more emotionally open than are the field-independent persons. It 
appears that neither end of the field dependent-independent continuum is neces- 
sarily or uniformly favorable or unfavorable; rather, the value of deviations in ei- 
ther direction depends on the demands of particular situations. 

Much of this research has used the Embedded Figures Test, which is relatively 
easy to administer. Published forms of this test are available for the adult, child, 
and preschool levels, as well as for group administration (Coates, 1972; Witkin, 
Oltman, Raskin, & Karp, 1971). Two demonstration items from the Group Em- 
bedded Figures Test are shown in Figure 16-2. In both the children's and the 
preschool forms, the complex figures are recognizable, familiar objects, and the 
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Try to find the simple form in the complex figure and trace it in pencil 
directly over the lines of the complex figure. It is the SAME SIZE, in 
the SAME PROPORTIONS, and FACES IN THE SAME DIRECTION 
within the complex figure as when it appeared alone. 


Figure 16-2. Demonstration Items from the Group Embedded 
Figures Test. 
(Copyright © 1972 by Consulting Psychologists Press. Reproduced by permission.) 


tests are administered individually. The original adult form is also individually ad- 


e.g., Bertini, Pizzamiglio, & Wapnet, 1985), much remains to be done to clear up 
inconsistencies found in studies of field independence and its relationship to a wide 
diversity of behavioral variables. One problem pertains to nomenclature and, in 
turn, reflects differences in how field independence is regarded. Some investigators 
have concluded, based on its correlation with measures of g, that the field indepen- . 
dence construct is a cognitive ability after all; others have classified it as a cognitive 
“control” and have placed it at an intermediate position, between cognitive abili- 
ties and cognitive styles (McKenna, 1984; Jonasson & Grabowski, 1993). A related 
obstacle to generalization is presented by lack of uniformity among studies with re- 
gard to participant characteristics and methodology. For instance, findings suggest 
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that whereas the results on paper-and-pencil measures, such as the Group Embed- 
ded Figures Test, are heavily influenced by general intelligence, scores on perfor- 
mance tests, like the Portable Rod-and-Frame, are more purely a reflection of field 
independence as a value-free stylistic variable (Arthur & Day, 1991). 

In spite of the conceptual and methodological obstacles their study may pre- 
sent, cognitive styles are of great theoretical and practical interest? Because they 
lie at the intersection between abilities and personality, these styles can influence 
and modulate behavior in both the affective and intellectual realms. Thus, ques- 
tions about their nature and characteristics, such as the extent to which they are 
fixed or flexible, acquire considerable importance (see, e.g., Niaz, 1987). Further- 
more, in endeavors where learning is a primary goal, there has been increasing 
recognition that the efficiency of both instruction and assessment may be consid- 
erably improved by taking stylistic factors into account ( Furnham, 1995; Jonassen 
& Grabowski, 1993; Lu & Suen, 1995; Sternberg, 1988, 1994b; Zelniker, 1989). 


Personality Types. Like cognitive styles, personality types also refer to con- 
structs that have been used to explain similarities and differences in preferred 
modes of thinking, perceiving, and behaving within and across individuals. Fun- 
damentally, personality types are categories defined by certain configurations of 
two or more traits or attributes. As explanations for human behavior, typologies 
have a long history that dates back to the ancient Greek theory of body humors— 
blood, black bile, yellow bile, and phlegm—and the temperament types associ- 
ated with them, namely, sanguine, melancholic, choleric, and phlegmatic. Typo- 
logical systems often have enormous popular appeal because they offer relatively 
simple, seemingly firm bases for understanding and possibly explaining one’s 
behavior and that of others. Such systems vary a good deal in the number and 
kind of dimensions that can be used to generate types. 

Within psychology, a number of different typologies have been devised over 
the years. Some, like Sheldon’s system for classifying personality types on the 
basis of physique (Sheldon & Stevens, 1942/1970), were based mostly on specula- 
tion and have been discarded for lack of corroborating evidence. Others, like the 
MMPI code types described in chapter 13, arose from empirical observations, 
continued to evolve, and are still in use (Graham, 1993). Most typological sys- 
tems encompass distinctions within a single area, such as occupations or tempera- 
ments. À few pertain to more than one domain. Among the latter, Alan Miller's 
efforts to synthesize existing typologies across the cognitive, affective, and moti- 
vational dimensions stand out. Miller's (19912, 1991b) conceptual linkages across 
systems are based on a review and analysis of a large number of typologies within 
the three areas and should prove to be of considerable heuristic value. 


DE E NN e i RN A 
31t should be noted that cognitive styles vary in complexity. Some styles pertain to relatively simple differences; 
for example, the reflective-versus-impulsive style, or cognitive tempo, is defined only by the speed with which 
individuals respond to problem situations, especially those involving a high degree of uncertainty (Kagan, 1965; 
S. B. Messer, 1976). Others, such as Sternberg's thinking styles (1988, 1994b), encompass more complex pat- 
terns of dispositions. 
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Myers-Briggs Type Indicator. One of the most enduring typological classifi- 
cations was devised by Jung (1921/1971) and has served as the foundation for the 
Myers-Briggs Type Indicator (MBTI—Myers, 1962; Myers & McCaulley, 1985), 
which is a widely used instrument for the assessment of personality in norm: 
individuals.* The MBTI uses Jungs well-known dichotomy of extraverted an 
introverted attitudes (E and I), and his classifications of opposing ways © 
perceiving (sensation versus intuition—S vs. N) and of contrasting approaches to 
judgment (thinking versus feeling—T vs. F). It also includes a polarity of 
preferences in orientation toward the outer world (judgment or perception—J or 
P) that was not explicit in Jung’s work. The scores on these four—presumably 
independent—dimensions result in sixteen possible “type formulas” that give 
combinations of the letters of the preferred direction within each of the four 
dimensions. For example, the combination “INTP” represents the “introverted, 
intuitive, thinking, and perceptive” type? Each preferred direction also has a 
numerical score that indicates the strength of the preference and is calculated 
from the difference in points between the poles of each dimension. 

The results of the MBTI, unlike those of most other personality inventories, 
are intended primarily for use by the respondent and are presented in a nonjudg- 
mental fashion. Two of the most fundamental premises used in interpreting the 
results of the MBTI are (a) that all types are valuable and necessary and have spe- 
cial strengths and vulnerabilities and (b) that individuals are more skilled within 
their preferred functions, processes, and attitudes. These characteristics have con- 
tributed to the popularity of the MBTI and to its application for a variety of pur- 
poses, including career guidance, counseling, and team selection and develop- 
ment. The psychometric difficulties posed by the MBTI scores have been well 
documented (for reviews, see DeVito, 1985; J. S. Wiggins, 1989). In particular, re- 
viewers have objected to the forced dichotomies of the Myers-Briggs, whereby 
identical profile type letters are assigned on the basis of difference scores that may 
vary considerably in magnitude. Nevertheless, a substantial amount of research 
with the instrument continues to be carried out. In addition to investigations ad- 
dressing the validity of the MBTI, a number of studies have also explored alterna- 
tive ways of scoring it (see, e.g., Davis, Grove, & Knowles, 1990; Girelli & Stake, 
1993; Harvey & Murry, 1994). Moreover, variations of the Myers-Briggs and 
adaptations of its constructs for various uses have also been devised. One note- 
worthy variation is the recently developed Student Styles Questionnaire (Oak- 
land, Glutting, & Horton, 1996), which is aimed at evaluating styles of learning, 
working, and relating in students between the ages of 8 and 17 years. 

Notwithstanding the popularity of the Myers-Briggs and of a few other typolo- 
gies, historically, most psychologists who study personality have conceived of it as 


4The MBTI was mentioned briefly in chapter 13 in connection with the use of the forced-choice technique in 


self-report personality inventories. 3 
5According to the MBTI manual, such a person would, among other things; (a) be quiet and reserved, (b) enjoy 
solving problems with logic and analysis, (c) be interested mainly in ideas, and (d) tend to have sharply defin 
interests (Myers & McCaulley, 1985). 
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a psychic entity made up of components, or traits, on which people differ quanti- 
tatively, rather than as a taxonomy based on qualitative differences. Furthermore, 
within scientific psychology there has been a general reluctance to use types as 
explanatory concepts. This reluctance has stemmed primarily from three sources: 
(a) the emphasis that many personality theorists have placed on the need to un- 
derstand and value the uniqueness of individuals, (b) the implicit link between 
types and stereotypes and the inherent dangers associated with the latter, and (c) 
the paucity of adequate quantitative methods for identifying and analyzing cate- 
gorical data. 

In aseries of recent developments, however, some of the traditional objections 
to typological constructs appear to have been challenged. For instance, Paul 
Meehl believes that it is time to explore empirically the possible existence of nat- 
urally occurring, nonarbitrary classes—or taxa—within the nonpathological 
range of individual differences in personality. He and his associates have proposed 
new taxometric methods that can be used to identify such classes (Meehl, 1992, 
1995; Mech) & Golden, 1982; Mech) & Yonce, 1994).° Similarly, after reviewing 
a number of geometric trait models of personality, ranging from unidimensional 
bipolar traits to multidimensional circumplexes, Dahlstrom (1995) concludes 
that such models are inadequate to the task of organizing personality constructs 
and research. According to him, the inherently configural nature of personality 
structure and functioning requires “a schema of pigeon-holes to summarize in a 
faithful and accurate way how we differ and how we resemble each other” (p. 14). 
Finally, even the deeply entrenched notion that stereotypes are usually inaccurate 
and destructive has recently been questioned (Lee, Jussim, & McCauley, 1995). 

These and other developments portend a renewed interest in, and a more so- 
phisticated approach to, the use of multidimensional categories, such as types and 
styles, as part of the conceptual arsenal available to study and explain individual 
differences in behavior. Nevertheless, regardless of how useful or popular such 
constructs may be, they will undoubtedly coexist with—not replace—approaches 
that emphasize the unique aspects of individuals and of their behavior. In the as- 
sessment of people, the dangers of reifiying types and using them simplistically or 
overinclusively as explanatory concepts always need to be kept in mind. 


SITUATIONAL TESTS 


Although the term “situational test” was popularized during and following World 
War II, tests fitting this description had been developed prior to that time. Essen- 
tially, a situational test is one that places the test taker in a situation closely re- 
sembling or simulating a "real life" criterion situation. Such tests thus show 
certain basic similarities to the job-sample techniques employed in constructing 


Although there is little evidence concerning their transportability from one situation to another, the subgroups 
generated through cluster analyses of biodata—discussed later in this chapter—also offer interesting possibilities 
from the point of view of typological classification. 
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occupational achievement tests and to the performance-based assessments of edu- 
cational achievement, both of which are described in chapter 17. In the present 
tests, however, the criterion behavior that is sampled is usually more varied and 
complex. Moreover, performance is evaluated primarily in terms of emotional, in- 
terpersonal, attitudinal, and other personality variables, rather than in terms of 
abilities and knowledge: 


Tests of the Character Education Inquiry. Among the earliest situational 
tests—although they were not so labeled at the time—were those constructed by 
Hartshorne, May, and their associates (1928, 1929, 1930) for the Character 
Education Inquiry (CEI). These tests were designed principally as research 
instruments for use in an extensive project on the nature and development of 
character in childrens Nevertheless, the techniques can be adapted to other 
testing purposes, and a number have been so utilized. 

In general, the CEI techniques made use of familiar, natural situations within 
the schoolchild’s daily routine. The tests were administered in the form of regu- 
lar classroom examinations; as part of the pupil's homework, in the course of ath- 
letic contests, or as party games. Moreover, the children were not aware that 
they were being tested, except insofar as ordinary school examinations might be 
involved in the procedure. At the same time, all of the Hartshorne-May tests 
represented carefully standardized instruments which yielded objective, quanti- 
tative scores. 

The CEI tests were designed, in novel and ingenious ways, to measure such be- 
havior characteristics as honesty, self-control, and altruism. The largest number 
were concerned with honesty and involved situations in which the children were 
made to believe that they could cheat without being detected. In the Circles Puz- 
zle, for example, the child was instructed to make a mark in each of 10 small, ir- 
regularly arranged circles, while keeping her or his eyes shut. Control tests under 
conditions that precluded peeking indicated that a score of more than 13 cor- 
rectly placed marks in a total of three trials was highly improbable. A score above 
13 was therefore recorded as evidence-of peeking. que 

Most of the CEL tests proved to have good discriminative power, yielding a 
wide range of individual differences in scores. Reliability also appeared to be fairly 
satisfactory. The children's responses; however, showed considerable situational 
specificity. The intercorrelations of different tests within. each category (e.g hon- 
esty tests or persistence tests) proved to be very low. This specificity is under- 
standable when we consider the operations of the child's interests, values, and 
motives in different situations. For example, the child who is motivated to excel 
in schoolwork is not necessarily concerned about achievement in athletic con- 
tests or party games. These motivational differences would in turn be reflected in 
the child’s behavior on honesty tests administered in these different contexts. 
Both in their findings and in their interpretations, the CEI investigators antici- 
pated the emphasis on situational specificity by four decades. However, subse- 
quent reanalyses of the CEI data have suggested that a more appropriate model 
includes both a general factor of honesty and a situational component and that 
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there is even some evidence for a general factor of moral character (Burton, 1963; 
Rushton, 1984).” 


Situational Tests in Assessment Centers and Roleplaying Techniques. 
Situational tests constituted a major part of the assessment-center program 
introduced by the United States Office of Strategic Services (OSS) during World 
War II. The assessment-center technique involves essentially a “living-in” period 
of several days, during which candidates are observed and examined in a variety 
of ways. It represented the principal procedure in the selection of military 
personnel for critical overseas assignments (Murray & MacKinnon, 1946; OSS 
Assessment Staff, 1948). Similar procedures were subsequently incorporated in 
the Institute for Personality Assessment and Research at the University of 
California, as well as in a number of large-scale assessment projects for both 
military and civilian personnel? Still later, the assessment-center technique, 
together with some situational tests, was adopted by large industrial corporations 
for the evaluation of high-level executives. 

One type of test developed by the OSS was the situational stress test, designed 
to sample the individual’s behavior under stressful, frustrating, or emotionally dis- 
ruptive conditions. For example, the examinee was assigned a task to perform 
with two “helpers” who were obstructive and uncooperative. Another type of sit- 
uational test employed a leaderless group as a device for appraising such character- 
istics as teamwork, resourcefulness, initiative, and leadership. In such tests, the 
assigned task requires the cooperative efforts of a group of examinees, none of 
whom is designated as leader or given specific responsibilities. Examples from the 
OSS program include the Brook Situation, involving the transfer of personnel 
and equipment across a brook with maximum speed and safety; and the Wall Sit- 
uation, in which people and materials had to be conveyed over a double wall sep- 
arated by an imaginary canyon. 

A variant of this technique is the Leaderless Group Discussion (LGD). Requir- 
ing a minimum of equipment and time, the LGD has been used widely in the 
selection of such groups as military officers, civil service supervisors and admin- 
istrators, industrial executives and management trainees, sales trainees, teachers, 
and social workers. Essentially, the group is assigned a topic for discussion during a 
specified period. Examiners observe and rate each person’s performance but do not 
participate in the discussion. Although often used under informal and unstandard- 
ized conditions, the LGD has been subjected to considerable research. Studies 
with this technique suggest that, especially when raters are properly trained, it can 
be an effective tool for predicting performance in jobs requiring verbal communi- 
cation, verbal problem solving, and acceptance by peers (Bass, 1954; Greenwood 
& McNamara, 1967; Guilford, 1959; Thornton & Zorich, 1980). 


"For an interesting analysis of the role that the Hartshorne and May studies have played in the debate concern- 
ing the teaching of virtues and character development, see Vitz (1990). 

®An account of these milestone assessment studies, beginning with the OSS program, can be found in LS Wig- 
gins (1973/1988, chap. 11). 
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i Some situational tests employ roleplaying or improvisation to elicit the behav- 
ior of interest. In fact, the previously cited leaderless group discussions, as well as 
certain other situational tests, may be regarded as variants or elaborations of role- 
playing. Although roleplaying was one of the techniques used in the OSS assess- 
ment program, it has earlier origins and wider applications. A comprehensive 
survey of the history, rationale, and varieties of the roleplaying or improvisation 
technique can be found in McReynolds and DeVoge (1978). In this technique, 
the individual is explicitly instructed to play a part, either overtly (with or with- 
out other persons) or by reporting verbally what he or she would do or say. The 
situation may be presented realistically, as on a stage, or through audiotape, 
videotape, or printed description.? 

The improvisation technique continues to enjoy great popularity, although 
much of its use is informal and tailored to specific settings and local conditions. 
One major application is in the occupational assessment of personnel, especially 
when interpersonal behavior is important in job functions (Stricker, 1982; 
Stricker & Rock, 1990). A special example pertains to the evaluation of coun- 
selor effectiveness. In this situation, the prospective counselor is observed—or 
videotaped—while conducting a counseling session with a “coached client,” that 
is, a staff member or fellow student in the role of a client who presents a prese- 
lected and standardized problem (see, e.g., Connor, 1994, pp. 72-75; Kelz, 1966; 
Neufeldt, Iversen, & Juntunen, 1995; A. Williams, 1995). When videotaping is 
used, a trainee can monitor and evaluate her or his own performances, in addition 
to being rated by assessors and peers.!° 

Assessment-center techniques have repeatedly proved to be effective predic- 
tors of a variety of criteria (see, e.g. Coulton & Feild, 1995; Howard & Bray, 
1988; Ritchie, 1994; Tziner, Ronen, & Hacohen, 1993). They have been exten- 
sively used in fields in which selection criteria are complex, such as law enforce- 
ment work (J. L. Coleman, 1987; Moore & Unsinget, 1987). Some work has even 
been done on adapting these techniques for use with deaf job applicants (Berkay, 
1993). Because of the breadth of their applicability, we cannot generalize about 
the validity of the assessment-center technique; the results differ from one center 
to another, depending upon the specific procedures employed, the nature of the 
criterion, and the qualifications of the assessors. In general, validity coefficients 
tend to be highest when methodologies are soundest-—such as in studies that use 
several devices, including peer evaluations—and focus on relevant and directly 
observable behavioral dimensions (Gaugler, Rosenthal, Thornton, & Bentson, 
1987; Shore, Shore, & Thornton, 1992; Thornton & Byham, 1982). In spite of 


extensive research, a number of questions about assessment-center techniques 
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Multimedia and interactive computer technologies allow entirely new ways of presenting realistic stimulus situ- 
ations and response options to test takers. For a discussion of possibilities and problems in developing multime- 
dia tests and a description of an interactive video instrument measuring conflict resolution skills, see Drasgow, 
Olson-Buchanan, and Moberg (1996) and Olson-Buchanan, Drasgow, Moberg, Mead, and Keenan (1996). 

1p clinical psychology, roleplaying has undergone extensive and systematic development within several differ- 
ent theoretical orientations, particularly in behavior modification programs, family therapy, and marriage 
counseling. 


468 PERSONALITY TESTING 


remain unanswered. Probably the most vexing question of all pertains to the in- 
ability of several investigations to demonstrate convergent validity across differ- 
ent methods of assessing particular dimensions of performance. ! 


SELF-CONCEPTS AND PERSONAL CONSTRUCTS 


The 1980s and 1990s have witnessed a resurgence of interest in the self-concept 
and related constructs (Byrne, 1996; Harter, 1990; Hattie, 1992; Markus & Wurf, 
1987; Oosterwegel & Oppenheimer, 1993; Wylie, 1989). Several current ap- 
proaches to personality assessment concentrate on the way individuals view 
themselves and others. Such techniques often reflect the influence of phenome- 
nological psychology, which focuses on how events are perceived by the individ- 
ual. The individual's self-description thus becomes of primary importance in its 
own right, rather than being regarded as a second-best substitute for other behav- 
ioral observations. Interest also centers on the extent of self-acceptance shown by 
the individual. 

The procedures considered in this section all have as their primary focus indi- 
viduals’ perceptions of themselves and others. Although a few instruments are 
published tests, many were developed for use in specific research projects and are 
available in the published reports of these projects. Some are of interest chiefly 
because of their association with particular theories of personality or with an ac- 
tive area of ongoing research. Others are widely applicable techniques used in 
studying a variety of problems. 


Washington University Sentence Completion Test. It might be argued 
that self-concept tests do not differ essentially from the self-report inventories dis- 
cussed in chapter 13. True, but it would be more accurate to say that self-report 
inventories are actually measures of self-concept. The interpretation of inventory 
responses in terms of self-conceptualization forms the basis of the theoretical 
approach to personality development formulated by Loevinger (1966a, 1966b, 
1976, 1987, 1993; Loevinger & Ossorio, 1958). Bringing together many disparate 
findings from her own research and that of others, Loevinger proposed a person- 
ality trait that she defined as the capacity to conceptualize oneself, or to “assume 
distance" from oneself and one's impulses. According to Loevinger, it is the 
manifestations of this trait in personality inventories that have been described in 
such terms as facade, test-taking defensiveness, response set, social desirability, 
acquiescence, and personal style. In common with a number of other psycholo- 
gists, Loevinger regards such test-taking attitudes not as instrumental errors to be 
ruled out but as the major source of valid variance in personality inventories. 


'lFor brief but informative reviews of assessment-center research, see Landy, Shankster, and Kohler (1994, pp. 
277-278), and Schmidt, Ones, and Hunter (1992, pp. 635-637). 


Län overview of the philosophical and psychological approaches to the self and self-related processes, from the 
17th through the 20th centuries, can be found in Levin (1992). 
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= On the basis of data from many sources, Loevinger suggested that ability to 
form a self-concept increases with age, intelligence, education, and socioeco- 
nomic level. At the lowest point, illustrated by the infant, the individual is inca- 
pable of self-conceptualization. As the ability develops, the child gradually forms 
a stereotyped, conventional, and socially acceptable concept of herself or himself. 
This stage Loevinger considers typical of adolescence: With increasing maturity, 
the individual progresses beyond such a stereotyped concept to a differentiated 
and realistic self-concept. At this point, individuals are fully aware of their idio- 
syncrasies and accept themselves for what they are. 

It is this trait of self-conceptualization, designated as ego development or 
ego level, that Loevinger and her associates undertook to measure in the Wash- 
ington University Sentence Completion Test (WUSCT: — Loevinger, 1985, 1987; 
Loevinger & Wessler, 19770; Loevinger, Wessler, & Redmore, 1970). The authors’ 
theoretical framework postulates nine levels of ego development, as follows: Pre- 
social, Impulsive, Self-Protective, Conformist, Self-Aware, Conscientious, Indi- 
vidualistic, Autonomous, and Integrated. All but the first level, which precedes 
the emergence of verbal skills, can be assessed through the WUSCT. Each sen- 
tence completion is assigned an ego level rating, and a composite score on the en- 
tire test is computed from these values. The WUSCT was based on research 
conducted: with women and adolescent girls and was later adapted for use with 
men and boys. The test has since been revised and now provides comparable forms 
for men and women (Loevinger, 1985; Novy, 1992). A new scoring manual pre- 
sents data for both females and males (Hy & Loevinger, 1996). 

Continuing research with the WUSCT has provided evidence supporting its 
reliability (Novy & Francis, 1992; Weiss, Zillberg, & Genevro, 1989) and its va: 
lidity as à measure of the construct of ego development among various samples 
(Bushe & Gibbs, 1990; Novy, Gaa, Frankiewicz, Liberman, & Amerikaner, 1992; 
Westenberg & Block, 1993). A particularly fruitful application of the instrument 
has been in the study of sex differences in ego development (Cohn, 1991). One of 
the few problematic aspects of the WUSCT, shared by other sentence completion 
tests and techniques that allow for open-ended verbal responses, is that its scores 
tend to correlate with, and may in some cases be influenced by, verbal fluency and 
vocabulary level (see, e.g Vaillant & McCullough, 1987; Westenberg & Block, 
1993). Although this possibility needs to be taken into account in designing re- 
search with the WUSCT, as well as in other uses of the test, it is consonant with 
its theoretical underpinnings and need not invalidate results. 


Self-Esteem Inventories and Related»Measures. In an ever-expanding 
body of research, the construct of self-concept merges with ‘certain related 
constructs designated as self-esteem (Baumeister, 1993; Bednar & Peterson, 1995) 
and perceived self-efficacy (Bandura, 1982, 1995; Maddux, 1995; Schwarzer, 
1992). A major thrust of this research relates to the effect of the individual's self- 
evaluation on her or his performance. Self-esteem is, in fact, typically described as 
the evaluative component of the self-concept. On a long-term, cumulative basis, 
such self-evaluations may influence the development of cognitive and affective 
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traits. In particular, there is widespread agreement that self-esteem is a crucial 
determinant of such psychologically important variables as coping ability and 
sense of well-being. : 

The construct of self-esteem is, at first glance, deceptively simple. It is often as- 
sumed to be a general evaluative attitude toward oneself, ranging from extremely 
positive to extremely negative, that is stable and entirely subjective in nature. 
The measurement of self-esteem for research and applied purposes has tradition- 
ally proceeded on the basis of these assumptions. So much so, in fact, that Blas- 
covich and Tomaka (1991) in their comprehensive review of measures of 
self-esteem and self-concept identified Rosenberg's (1965) Self-Esteem Scale 
(SES)—a ten-item, face valid, self-report scale—as the most frequently used 
measure of its kind. 

Simple and obvious measures of global self-esteem, like the SES, do appear to 
be relatively stable over time and may be useful as a gauge of general self-regard 
for some purposes. Nevertheless, many investigators have found that the relation 
between self-concept and behavior can be more clearly demonstrated if the for- 
mer is viewed as a hierarchical, multidimensional construct and is assessed ac- 
cordingly (Fleming & Courtney, 1984; Marsh, Byrne, & Shavelson, 1992; Marsh 
& Shavelson, 1985; Shavelson & Bolus, 1982; Shavelson, Hubner, & Stanton, 
1976; Uguroglu & Walberg, 1979). Under certain circumstances, the use of a sin- 
gle, global self-esteem measure may yield inconsistent results or fail to reveal sig- 
nificant correlations with other variables, whereas a more narrowly defined 
construct, such as academic self-concept, will yield consistent and significant re- 
sults. This is especially true with those persons for whom academic achievement 
may not rank high in the personal value system. In such cases, a high global self- 
concept, slanted to reflect the respondent’s own value system, may not correlate 
significantly with academic achievement or intellectual functioning. Further- 
more, research indicates that measures of “experienced self-esteem,” based on 
self-reports, and measures of “presented self-esteem,” based on the reports of oth- 
ers; do not always correlate (Demo, 1985). 

As a result of these findings, research in recent years, has shifted away from 
unidimensional conceptualizations of self-esteem and toward studies of its specific 
facets, and it has proliferated rapidly. Especially numerous are investigations of 
the relationship between academic self-concept and academic achievement in 
children and adolescents. In this particular area, studies using sophisticated 
methodologies—such as structural equation modeling and longitudinal designs 
that allow for the analysis of the direction of causal relations—have supported 
the notion that academic self-concepts are linked to specific fields of study. More 
importantly, they have shown that these concepts correlate with, predict, and in- 
fluence subsequent acadernic achievement (Fortier, Vallerand, & Guay, 1995; 
House, 1995; Lyon & MacDonald, 1990; H. W. Marsh, 1990a, 1990b). 

The newest measures of self-concept make use of the extensive theoretical and 
empirical literature that has accumulated in this area. The Student Self-Concept 
Scale (SSCS—Gresham, Elliott, & Evans-Fernandez, 1993), for example, is a 
commercially available measure that uses Bandura’s (1982, 1986) theory of self- 
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efficacy as a starting; point and draws from other theories and research findings as 
well. It assesses three major domains of self-concept, namely, academic, social, 
and self-image. Within each of these domains, respondents indicate not only how 
confident they are that they are able to do what the items state but also how im- 
portant the items are to them and how confident they are that having certain at- 
tributes or doing certain things will lead to certain outcomes. The SSCS produces 
both subscale and composite scores that are norm-referenced, separately, for fe- 
male and male students at the elementary and high school levels. 

A similar, albeit not commercially available, measure—the Personal and Acade- 
mic Self-Concept Inventory (PASCI—Fleming & Whalen, 1990)—was devel- 
oped using: high. school and college students. This measure attempts to 


assess feelings of inadequacy: 3 The current version consists of a global self-esteem 
scale and six additional facet scales. Of these, two deal with social aspects of the 
self-concept (Social Acceptance and Social Anxiety), two pertain to its physical 
aspects (Physical Appearance and Physical Ability), and the two remaining are 
academic (Math Ability and Verbal Ability). The SSCS and the PASCI are, of 
course, subject to the-response biases inherent in all self-report measures and dis- 
cussed in chapter 13. Moreover, they both still need to document their effective- 
ness further, within the contexts for which they were developed. However, from 
the point of view of construct differentiation and content specification, they are 
representative of the advances that have been made in the conceptualization and 
measurement of self-esteem. 


The Adjective Check List. Several broadly oriented techniques have been 
developed specifically for assessing self-concepts. A widely applicable instrument 
that is commercially available is the Adjective Check List (ACL). Originally 
constructed for use in the research program of the Institute for: Personality 
Assessment and Research (IPAR), this instrument provides a list of 300 adjec- 
tives arranged alphabetically from “absentminded” to “zany” (Gough, 1960; 
Gough & Heilbrun, 1983). Respondents mark all the adjectives they consider 


descriptive of themselves: t 
In its current form, the ACL can be scored for 37 scales; four of which assess re- 


sponse sets. A major cluster of scales was originally prepared on a rational or con- 
tent basis, by assigning adjectives to each of the 15 Murray needs covered by the 
EPPS (see chap. 13). An additional set of nine "topical scales" was developed 
principally through empirical criterion keying of items to measure various traits 


considered important in interpersonal behavior. The two remaining clusters of 
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BI. zi maka (1991 wide a summary of the evolution of this instrument, as well as reviews of 
aam ` SC NEE model has been one of the most influential in the field and was 
also used in the concep! tion of the SSCS. For information on the model, see Marsh et al. (1992), Marsh 
and Shavelson (1985), Shavelson and Bolus (1982), and Shavelson et al. (1976). 
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scales were designed to fit specialized personality theories; namely, Berne's (1961, 
1966) Transactional Analysis theory and Welsh’s (1975b) theory of creativity and 
intelligence. For all 37 scales, a principal source of empirical validation data, em- 
ployed at some stage in the development of each scale, was the direct observation 
of participants in assessment-center programs at IPAR and the resulting trait rat- 
ings. On the basis of these IPAR assessments as well as other supplementary te- 
search, the manual provides personality descriptions of persons scoring high and 
low on each scale. 

Asa research instrument, the ACL has been applied to a staggering variety of 
problems, drawn from such areas as psychopathology, occupational choice, cre- 
ativity, political and economic behavior, and even patients’ reactions to ortho- 
dontia and to contact lenses. It has also been employed in rating historical 
personages from their biographies and published works (Welsh, 1975a) and in 
characterizing inanimate objects, such as cities and automobiles. More recently, 
the ACL has been used in studies of the developmental changes in women at 
midlife (Helson & Wink, 1992; Wink & Helson, 1993; York & John, 1992), and 
in investigations of narcissism (Wink, 1991, 1992), among others (for reviews, 
see Teeter, 1985; Zarske, 1985).!4 


Q Sort. Another special technique suitable for investigating self-concepts is 
the Q sort, originally developed by Stephenson (1953) to implement an approach 
to research known as Q methodology (see, e.g., Kerlinger, 1986, chap. 32; 
McKeown & Thomas, 1988). In the Q sort technique, the respondent is given a 
set of cards containing statements or trait names to be sorted into piles ranging 
from “most characteristic” to “least characteristic” of herself or himself. The items 
may come from a standard list, but more often they are designed to fit the 
individual case. To ensure uniform distribution of ratings, a “forced normal” 
distribution is used, the respondent being instructed to place a specified number 
of cards in each pile. Such a distribution can be prepared for any size of item 
sample by reference to a normal curve table. It should be noted that, like the 
forced-choice technique discussed in chapter 13, the Q sort yields ipsative rather 
than normative data. In other words, respondents tell us which they consider 
their strong and which their weak traits, but not how strong they believe 
themselves to be in comparison with other persons or with some outside norm. 

Q sorts have been employed to study a variety of psychological problems (Bem 
& Funder, 1978; Block, 1961/1978; Kogan & Block, 1991; Ozer, 1993; Rogers & 
Dymond, 1954). In intensive investigations of individual personality, the respon- 
dent is often asked to re-sort the same set of items within different frames of refer- 
ence. For example, the items may be sorted as they apply to oneself and to other 
persons, such as one's father, mother, husband, or wife. Or they may be sorted as 
they apply to oneself in different settings, such as job, home, or social situations. 
Q sorts can likewise be obtained for individuals as they believe they actually are 


14 ibli f 
s DER bibliography of the ACL through 1980 is available from the test publisher (Gough & Heilbrun, 
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(real self), as they believe others see them (social self), and as they would like to 
be (ideal self). To observe change, Q sorts may be obtained successively at differ- 
ent stages during psychotherapy, a procedure that has been followed especially by 
client-centered therapists. With therapy, the self-concept tends to become more 
favorable and to resemble more closely the individual's ideal self-concept (Rogers 
& Dymond, 1954, chap. AUS 


The Semantic Differential. This technique was first developed by Osgood 
and his associates (Osgood, Suci, & Tannenbaum, 1957) as a tool for research on 
the psychology of meaning, although its possibilities for personality assessment were 
soon recognized. The Semantic Differential represents a standardized and quantified 
procedure for measuring the connotations of any given concept for the individual. 
Each concept is rated on a 7-point graphic scale as being more closely related to one 
or the other of a pair of opposites, as illustrated in Figure 16-3. For every concept, a 
series of these bipolar adjectival scales is employed; usually 15 or more scales are 
included. Intercorrelations and factorial analyses of the original set of 50 scales 
developed by Osgood revealed three major factors; Evaluative, with high loadings 
in such scales as good-bad, valuable-worthless, and clean-dirty; Potency, found in 
such scales as strong-weak, large-small, and heavy-light; and Activity, identified 
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Figure 16-2. Illustration of the Semantic Differential 
Technique. In rating the concept “Father,” the respondent 
places a check mark on the appropriate segment of each scale. 
Usually a much larger number of scales is employed. 


(SS 
!5AII these procedures can, of course, be followed also with the previously described Adjective Check List or 
Q sorts and checklists can be used to report 


with any other techniques for assessing self-concepts. In addition, 
observers’ evaluations as discussed later in this chapter. 
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in such scales as active-passive, fast-slow, and sharp-dull. The evaluative factor is 
the most conspicuous, accounting for the largest percentage of total variance. 
Responses on the Semantic Differential’can be analyzed in several ways. For 
quantitative treatment, the ratings on each scale can be assigned numerical values 
from 1 to 7, or from —3 to 43. The overall similarity of any two concepts for an indi- 
vidual or a group can then be measured in terms of their positions on all scales. The 
connotations of all concepts rated by a single individual can be investigated by com- 
puting the “score” of each concept in the three principal factors described above. 
Thus, on a scale extending from —3 to +3, a given individual's concept of “My 
Brother” may rate —2 in the evaluative factor, 0.1 in potency, and 2.7 in activity. 
The concepts to be rated can be chosen to fit whatever problem is being inves- 
tigated. Respondents can, for example, be asked to rate themselves, members of 
their family, friends, employers, teachers, or public figures; members of different 
ethnic or cultural groups; persons engaged in different occupations; activities, 
such as studying or outdoor sports; abstract ideas, such as confusion, hatred, sick- 
ness, peace, or love; product names or brand names; and radio or television pro- 
grams. The Semantic Differential has been applied in many different contexts, in 
research on such varied problems as clinical diagnosis and therapy, vocational 
choices, cultural differences, and consumers’ reactions to products and brand 
names (Snider & Osgood, 1969). Moreover, further refinements of the technique 
itself continue to be made as well (see, e.g., Cogliser & Schriesheim, 1994). The 
bibliography on the Semantic Differential includes more than 2,000 references. 


Role Construct Repertory Test. A technique devised specifically as an aid 
in clinical practice is the Role Construct Repertory Test (Rep Test) developed by 
G. A. Kelly (1955, 1963, 1970). The development of the Rep Test is intimately 
related to Kelly’s personality theory. A basic proposition in this theory is that the 
concepts or constructs an individual uses to perceive objects or events influence her 
or his behavior. In the course of psychotherapy, it is frequently necessary to build 
new constructs and to discard some old constructs before progress can be made. 
The Rep Test was designed to help the clinician identify some of the client’s 
important constructs about people. Although the test can be administered in 
many ways, including both group and individual versions, it always involves the 
sorting of stimuli, in one way or another. The Rep Test yields data that can be 
arranged in a matrix, or grid, and that allow the assessment of relationships 
among constructs. One of its simpler variants will serve to illustrate its essential 
characteristics.!° In this variant, the respondent is first given a Role Title List and 
asked to name a person in her or his experience who fits each role title. Typical 
roles might include: your father, your wife or present girlfriend, a teacher you 
liked, and a person with whom you have been closely associated recently who ap- 
pears to dislike you. The examiner next selects three of the persons named and 
asks, “In what important way are two of them alike but different from the third?” 


16For descriptions of other procedural variants, see Bannister and Mair (1968), Beail (1985), Landfield and Ept- 
ing (1987), G. J. Neimeyer (1989), and Winter (1992). 
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This procedure is repeated with many other sets of three names, in which some of 
the names recur in different combinations. Once the raw data of relevant con- 
structs have been generated in this fashion and organized in grid form, their im- 
plications can be analyzed. Although the Rep Test yields a wealth of qualitative 
data, many quantitative procedures also have been developed to establish how 
important the elicited constructs are to the individual. These procedures range 
from simple descriptive statistics to highly complex structural analyses. Software 
programs are available to aid in both the content and structural analyses of proto- 
cols (see, e.g., Winter, 1992). 

The Rep Test, in its various modifications, has been used in a considerable 
amount of research on problems relating to; among other things, personality the- 
ory, social cognition, education, and communication, as well as to psychotherapy 
and assessment. Among the indices derived from the respondent's classifications 
of familiar persons is one designated as cognitive complexity. This index is based on 
the number of different constructs employed by an individual and is regarded as a 
measure of cognitive style. A higher degree of cognitive complexity means that 
the individual employs more dimensions, and hence a more differentiated cogni- 
tive system, in organizing or representing the environment (Bieri, 1971; Bieri et 
al., 1966; Goldstein & Blackman, 1978a, pp. 483-487; 1978b). However, cogni- 
tive complexity itself is a multifaceted construct, and its degree of correlation 
with other variables depends to a large extent on how it is conceptualized and 
measured (see, e.g., Goldsmith & Nugent, 1984). 

After a period of relative dormancy, the 1980s witnessed a resurgence of inter- 
est in Kelly’s personal construct theory, and research within its framework has 
continued at an increasing pace since then (Bannister, 1985; Burr & Butt, 1992; 
Epting & Landfield, 1985; Fransella & Thomas, 1988; G. J. Neimeyer & R. A. 
Neimeyer, 1990; R. A. Neimeyer & G. J. Neimeyer, 1992). Kelly’s theory also has 
been used as one of the pillars of constructivism.!’ Because it is ideally suited to 
investigating the unique ways in which the individual derives and organizes 
meaning, the Rep Test itself has undergone extensive use in recent years. Investi- 
gators are now applying more sophisticated statistical techniques, such as hierar- 
chical cluster analysis and multidimensional scaling, in analyzing repertory grid 
data (see, e.g., Merluzzi, 1991; Ogilvie & Ashmore, 1991). The technique, how- 
ever, is subject to so many variations that generalizations about its effectiveness 


and psychometric properties cannot be made. 


Perceived Environment and Social Climate. The concept of the environ- 
ment enters into psychology in many ways.!8 We have already considered the 


ee 
'7Constructivism is a lively and rather controversial movement within psychology that sees human beings as 
proactive agents whose major endeavor is to make meaning out of experience and calls for multiple perspectives 
on knowledge—beyond positivism (Gergen, 1985; Mahoney, 1991; G. J. Neimeyer, 1993; R. A. Neimeyer & 
Mahoney, 1995). 

18n fact, after three decades of activity, the study of 
flourishing not only in the United States but also around the world (see, 
Groat, 1995; McAndrew, 1993; Stokols, 1995; Stokols & Altman, 1987). 


‘environment and behavior is a multidisciplinary field that is 
eg, Bonnes & Secchiaroli, 1995; 
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significant role that observers’ assessment of the environment can play in the 
context of testing special populations (chap. 9), as well as the impact that cultural 
and situational variables have on the expression of cognitive and affective traits 
(chaps. 9, 11, 12, & 13). From a phenomenological perspective, the assessment of 
environments and social climates through the reported perceptions of the persons 
within each milieu can also contribute significantly to the understanding of indi- 
viduals and groups. 

A number of measures have been devised to describe and evaluate physical and 
social aspects of the environment (for reviews of several of these, see Walsh & 
Betz, 1995, chap. 11). The ten Social Climate Scales, developed by Rudolph H. 
Moos (1974, 1993a, 1993b, 1993c, 1994a; Moos & Spinrad, 1984) at Stanford 
University, are among the most frequently used and versatile instruments of this 
type. The Social Climate Scales are applicable in the following contexts: hospital- 
based and community psychiatric treatment programs, sheltered care and correc- 
tional facilities, military settings, university student residences, high school 
classrooms, work milieus, and families; there is also a more general group environ- 
ment scale for task-oriented, social, and mutual support groups. Most of the scales 
consist of 90 or 100 true-false items with which the respondent describes her or his 
perception of the given environment. Item "ere originally prepared to tap theo- 
retically chosen environmental dimensions, —— has press toward involvement, au- 
tonomy, or order; final items were empirically selected on the basis of their ability 
to discriminate among contexts, as well as their internal consistency within sub- 
scales. Each of the ten scales has been revised and updated one or more times 
(Moos, 1994b; Moos & Moos, 1994; Trickett & Moos, 1995). The Social Climate 
Scales can be administered in three different forms which measure (a) actual per- 
ceptions of the real environment, (b) perceptions of an ideal environment, and (c) 
expectations about an unknown environment of a certain type. 

Each environmental context is described in terms of 7 to 10 subscale scores 
that assess various dimensions of the environment. Several of these dimensions 
recur in the scales for different contexts. In fact, despite the wide diversity of 
contexts covered by the Social Climate Scales, the subscales in each context all 
fit into the same tripartite classification, including (1) Relationship Dimen- 
sions (e.g., Involvement, Support, Peer Cohesion); (2) Personal Growth Di- 
mensions (e.g., Autonomy, Task Orientation, Competition); and (3) System 
Maintenance and Change Dimensions (e.g., Order & Organization, Clarity, In- 
novation). It is also noteworthy that the Social Climate Scales were designed 
for relatively small units within complex and heterogeneous institutions—for 
example, a classroom rather than a whole school, a treatment program rather 
than a whole hospital, a university residence center rather than a whole univer- 
sity. In this respect, these scales yield more readily interpretable and less am- 
biguous data than would be obtained from a composite assessment of an entire 
organization. Although the User’s Guide for the Social Climate Scales (Moos, 
19942) indicates that they may be applied in clinical assessment, organizational 
consulting, and program evaluation, several reviewers believe that the scales are 
more suitable for research on the determinants and consequences of perceptions 
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about the environment (Allison, 1995; Loyd, 1995; R. O. Mueller, 1995; 
Saudargas, 1989; Sheehan, 1995; C. R. Smith, 1989). 


OBSERVER REPORTS 


The tests considered thus far give ample evidence of the variety of approaches 
that have been followed in the assessment of personality. Yet the best that can be 
said about most of them is that they are promising experimental techniques suit- 
able for research purposes or useful instruments when interpreted by a skilled clin- 
ician in conjunction with other information about the person. It is apparent that 
for the assessment of personality one cannot rely entirely on standardized tests. 
Other sources of information are needed to follow up or supplement the leads pro- 
vided by test scores, to assess traits for which no adequate tests are available, and 
to obtain criterion data for developing and validating personality tests. 

Direct observations of behavior play an essential part in personality appraisal, 

whether in the clinic, counseling center, classroom, personnel office, or any other 
context calling for individual evaluations. To place such behavioral observations 
in the proper perspective, we must remember that all tests are themselves evalua- 
tions of small samples of behavior. To be sure, these behavior samples are ob- 
tained and evaluated under standardized conditions. But against the obvious 
advantages of such standardized procedures, we must balance the advantages of a 
much more extensive sampling of behavior available through observational tech- 
niques in natural settings. To take an extreme example, if we had a detailed biog- 
raphy of a person extending from birth to age 30, we could probably predict her or 
his subsequent behavior more accurately than could be done with any test or bat- 
tery of tests. Such a record of all the minutiae and circumstances of a person's life 
would be hard to come by; but if we had it, we could make predictions from a 30- 
year behavior sample rather than from the one- or two-hour samples provided by 
tests. 
In all the techniques to be considered in this section, what the individual does 
in natural contexts over relatively long periods of time is transmitted through the 
medium of one or more observers. Much can be done to improve the accuracy and 
communicability of such observations. 


Naturalistic Observation. Techniques for the direct observation of sponta- 
neous behavior in natural settings have been employed most widely by child 
psychologists, particularly with preschool children. Although such procedures 
can be followed with persons of any age, the younger the individual the less 
likely it is that the behavior of interest will be affected by the presence of the 
observer or that the individual will have developed the social facades that 
complicate the interpretation of behavior. These observational techniques have 
also proved useful in the classroom, especially if the observer is the teacher or 
someone else who fits readily into the normal school setting. A major application 
of such assessment techniques is to be found in behavior modification programs 


478 PERSONALITY TESTING 


conducted in schools, homes, child care centers, clinics, hospitals, or any other 
context (Hartmann & Wood, 1990; Kent & Foster, 1977; Lalli & Goh, 1993). 
Several ingenious applications of naturalistic observation have also been devised 
for social psychology (Webb, Campbell, Schwartz, Sechrest, & Grove, 1981) and 
cross-cultural research (Bochner, 1986). 
Naturalistic observation covers a wide variety of procedures (Adler & Adler, 
1994; Jones, Reid, & Patterson, 1975; Sattler, 1988, chap. 17). They range from 
comprehensive long-term techniques, as illustrated by the diary method, to more 
narrowly circumscribed, shorter, and more highly controlled observations, as illus- 
trated by time sampling. The latter comprises a representative distribution of short 
observation periods. Depending on the nature and purpose of the observations, such 
periods may vary in length from less than a minute to several hours; periods of five 
minutes or less are the most common. The observations may be concentrated in one 
day or spaced over several months. They may cover all behavior occurring during 
the specified period; but more often they are limited to a particular kind of behavior, 
such as language, locomotion, interpersonal behavior, or aggression. Checklists of 
what to look for are a useful observational aid. Other procedural aids include obser- 
vational schedules, record forms, coding systems, and mechanical recording devices 
(W. W. Tryon, 1985, chaps. 7 & 8). When practicable, recordings can be made on 
audiotape, film, or videotape.'? Portable microcomputer systems that allow en- 
hancements in the collection and analysis of observation data are available as well 
(Kratochwill, Doll, & Dickson, 1991, pp. 137-141; Repp & Felce, 1990). 

It might be noted that naturalistic observations have much in common with the 
previously discussed situational tests. They differ principally in two respects: In 
naturalistic observations, no control is exerted over the stimulus situation, and— 
at least in most observational methods—a more extensive behavior sample is ob- 
served. Interest in research on naturalistic observation techniques is on the rise, es- 
pecially in reference to their ecological validity and their usefulness in assessing 
changes over time (see, e.g., Barkley, 1991; Kaminer, Feinstein, & Seifer, 1995). 


The Interview. Mention should also be made of the time-honored source of 
information provided by interviewing techniques. Interviewing serves many 
purposes in clinical psychology, counseling, personnel psychology, and education. 
Discussions of the methods, applications, and effectiveness of interviewing, and 
of research on the interviewing process, can be found in many sources.2° In form, 


‘An unusually comprehensive example of an assessment system based largely on observational data is the one 
devised by Gottman (1994, 1996) for the analysis of marital relationship processes. The system, which includes 
elements reminiscent of the assessment-center methodology, was developed and tested through a long-term pro- 
gram of research aimed at isolating the factors that predict the eventual outcome of a marriage. 


20The extensive literature on interviewing includes periodic surveys of relevant research (Eder, Kacmar, & Fer- 
ris, 1989; Graves, 1995; Groth-Marnat, 1990, chap. 3; Landy et al., 1994; McDaniel, Whetzel, Schmidt, & Mau- 
rer, 1994), as well as interviewing guides and recommendations for improving interviewing. procedures, 
especially in clinical practice (Bierman, 1990; Lukas, 1993; Morrison, 1995; Rogers, 1995; Shea, 1988) and in 
personnel selection (Fear & Chiron, 1990; Webster, 1982). A model for teaching and leatning the basic skills 
that are applicable across most interview settings can be found in Gorden (1992). 
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interviews may vary from the highly structured (representing little more than an 
orally administered questionnaire), through patterned or guided interviews 
covering certain predetermined areas, to nondirective and depth interviews in 
which the interviewer merely sets the stage and encourages the interviewee to 
talk as freely as possible. The use of structured interviews for both clinical and 
research purposes in the area of psychiatric diagnosis is now commonplace. These 
instruments ate standardized and typically provide quantitative scores in addition 
to diagnostic classifications; thus, they have to be evaluated by the same 
psychometric standards of reliability and validity that apply to all tests. For 
reviews of various structured interview protocols, see Hodges and Zeman (1993), 
Kamphaus and Frick (1996, chap. 12), and Rogers (1995). 

Interviews provide chiefly two kinds of information. First, they afford an op- 
portunity for direct observation of a rather limited sample of behavior mani- 
fested during the interview situation itself. For example, the individual’s speech, 
language usage, poise, and manner in meeting a stranger can be noted. A sec- 
ond—and perhaps more important—function of interviewing, however, is to 
elicit life-history data. What the individual has done in the past is a good indica- 
tor of what he or she may do in the future, especially when interpreted in light of 
concomitant circumstances and of the person's own comments regarding her or 
his actions. The interview should concern itself not only with what has hap- 
pened to the individual but also with her or his perceptions of these events and 
current evaluations of them. 

On the interviewer's part, the interview requires skill in gathering and in inter- 
preting data. An interview may lead to wrong decisions because important infor- 
mation was not elicited or because given data were inadequately or incorrectly 
interpreted. A critical qualification of the successful interviewer is sensitivity in 
identifying clues in the interviewee's behavior or in facts he or she reports. Such 
clues then lead to further probing for other facts that may either support or con- 
tradict the original hypothesis. 


Ratings. Although ratings may be obtained in many contexts and for diverse 
purposes, the present section is concerned with the use of ratings as an evaluation 
of the individual by the rater on the basis of cumulative, uncontrolled observations 
of daily life. Such ratings differ from naturalistic observations in that the data are 
accumulated casually and” informally; they also involve interpretation and 
le recording of observations. In contrast to both 
ly cover a longer 
and the information is obtained under more realistic condi- 


tions. Ratings are used extensively in assessing individuals in educational and 
and for many 


industrial settings, in obtaining criterion data for test validation, 
research purposes. Since the 1970s, there has been a significant upsurge of research 


on rating techniques, with emphasis on comprehensive, systematic investigations 


and sufficient standardization of definitions and procedures to facilitate compar- 
ison of findings across studies (Borman, 1991; Landy & Farr, 1983; Ozer & Reise, 


1994, pp. 370-371; Saal, Downey, & Lahey, 1980; Sulsky & Balzer, 1988). 
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Much can be done to improve the accuracy of ratings. A common difficulty 
arises from ambiguity in either trait names, scale units, or both. To meet this prob- 
lem, each trait should be defined in specific terms, and the ratings should be ex- 
pressed in a form that will be uniformly interpreted by all raters. Rather than 
using numbers or general descriptive adjectives that convey different meanings to 
different raters, degrees of a trait may be more clearly identified in terms of care- 
fully formulated behavioral anchors (Dickinson & Zellinger, 1980). There is also 
evidence that the relative accuracy of different scale formats may vary with the 
nature of the job or the performance function to be rated (Borman, 1979; J. M. 
Feldman, 1986). 

One of the conditions that affect the validity of ratings is the extent of the 
rater's relevant contact with the person to be rated (Freeberg, 1969; Landy & Farr, 
1980; Paulhus & Bruce, 1992; Wiggins & Pincus, 1992, pp. 493-496). It is not 
enough to have known the person for a long time; the rater should have had an 
opportunity to observe her or him in situations in which the behavior in question 
could be manifested. For example, if an employee has never had an opportunity to 
make decisions on the job, her or his ability to do so cannot be evaluated by the 
supervisor. In many rating situations, it is desirable to include a space to be 
checked in lieu of a rating if the rater has had no opportunity to observe the par- 
ticular trait in a given individual. 

Ratings are subject to a number of errors.”! A well-known example is the halo 
effect. Traditionally, this phenomenon has been defined as a tendency on the part 
of raters to be unduly influenced by a single favorable or unfavorable trait which 
colors their judgment of the individual's other traits. The presence of a halo effect 
is usually inferred from the observed intercorrelations among ratings assigned 
across different performance dimensions. Although the halo effect is still viewed 
in the traditional way—and as a unitary construct—by some investigators, sev- 
eral other possible ways of conceptualizing it and of defining it operationally have 
come to the fore (Balzer & Sulsky, 1992; Kozlowski, Kirsch, & Chao, 1986; Mur- 
phy & Anhalt, 1992; Nathan, 1986).”? 

Another error that can affect ratings data is the error of central tendency, or the 
tendency to place persons in the middle of the scale and to avoid extreme posi- 
tions. Still another is the leniency error, referring to the reluctance of many raters 
to assign unfavorable ratings. The former causes a bunching of ratings in the cen- 
ter of the scale, the latter at the upper end. Both errors reduce the effective width 
of the scale and make ratings less discriminative. One way to eliminate these er- 
rors is to employ ranking or other order-of-merit procedures that force discrimina- 
tion among individuals and hence maximize the information yielded by the 
ratings. Naturally, though, techniques that are applicable when comparisons are 
made within a single group do not permit direct comparisons across groups evalu- 
ated by different raters. 


21Much of the research on rating errors is reported in the Journal of Applied Psychology. 


22Some of these other conceptualizations view halo error, at least partly, as a function of the ratee, and even of 
the specific context of the evaluation, as well as of the rater. 
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An unexpected finding of the cumulative research on rating errors has been 
that the relationship between measures of such errors as halo or leniency and 
other more direct indicators of the accuracy of ratings is far from simple and is 
often counterintuitive. Several investigators who have conducted meta-analytic 
research and reviewed the literature on rating errors have concluded that the 
conceptual and methodological problems inherent in appraisal of performance 
through ratings preclude the use of measures of such errors as a gauge of the accu- 
racy of ratings (Balzer & Sulsky, 1992; Borman, 1991; Murphy & Anhalt, 1992; 
Murphy & Balzer, 1989). 

Nevertheless, the rating process can usually be improved by training the raters. 
Research in various settings has demonstrated the effectiveness of training in in- 
creasing reliability and validity of ratings and in reducing common judgment 
errors (Bernardin & Buckley, 1981; McIntyre, Smith, & Hassett, 1984; Pulakos, 
1986; Stamoulis & Hauenstein, 1993; Sulsky & Day, 1992, 1994). It should be 
noted, however, that many different types of training have been includedin rater 
training programs, and their effects vary in kind, amount, and duration. Training 
may involve providing raters with a uniform reference standard by which to eval- 
uate performance, analyzing common rating errors and ways of minimizing their 
influence, or improving observational skills. Any one or some combination of 
these and other types of training could be most appropriate for particular rating 
contexts and purposes. In most situations, however, enhancing the raters’ obser- 
vational skills is likely to yield favorable results. 

Clinical assessment often requires the collection of data from informants who 
are acquainted with the behavior patterns of the individual who is being assessed. 
Rating scales provide an efficient way of gathering such data and are particularly 
useful in the assessment of children and adolescents. A number of standardized 
scales for obtaining ratings from parents and rend have been published in re- 
cent years and are now commercially available.” One particularly comprehen- 
sive example is the Behavior ‘Assessment System for Children (BASC—Reynolds 
& Kamphaus; 1992). This system includes components for self-report and ob- 
server data in addition to the Teacher Rating Scales and Parent Rating Scales 
which are available in three forms, ranging from the preschool to the adolescent 
age levels (for a review, see R. B. Kline, 1994). 


Nominating Technique. A rating procedure that is especially useful in 
obtaining peer assessments is the nominating technique. Originally developed 
in sociometry (J. L- Moreno, 1953) for investigating group structure, this 

within any group of persons who have been together long 


technique may be used s d 
enough to be acquainted with one another, as in a class, factory, club, or military 
bers with whom 


unit. Each individual is asked to choose one or more group mem 
he or she would like to study, work, eat lunch, play, or carry out any other desig- 
nated function. Respondents may be asked to nominate as many group members 


medium us 
23Discussions of this type of instrument and reviews of several of them can be found in Kamphaus and Frick 


(1996), Piacentini (1993), and Witt, Heffer, and Pfeiffer (1990). 
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as they wish, or a specified number (such as first, second, and third choice), or 
only one person for each function. 

When used for individual assessment, the nominations received by any one 
person can serve to identify potential leaders (who receive many choices) as well 
as isolates (who are rarely or never mentioned). In addition; several indices can 
be computed for a more precise assessment of each individual. The simplest is a 
count of the number of times an individual was nominated for a specific function; 
which can be treated as her or his peer rating. The nominating technique may be 
employed with reference to any behavior of interest. For example, respondents 
can be asked to name the person who has the most original ideas, who can be 
counted on to get the job done, or who is the best sport. They may be asked to 
designate not only the person who is most like the given description but also the 
one who is least like it. In that case, positive nominations would be weighted +1 
and negative nominations —I in totaling each person's score.?^ It should be added 
that peer assessments can also be obtained through other procedures such as rank- 
ing or rating; but the nominating technique seems to have proved most successful 
and has been used most often. 

However obtained, peer assessments have generally emerged as one of the most 
dependable of rating techniques in such diverse groups as military personnel, indus- 
trial supervisors, Peace Corps volunteers, schoolchildren, and college students (Cole 
& White, 1993; Gresham & Little, 1993; Hughes, 1990; Kamphaus & Frick, 1996, 
chap. 10; Kane & Lawler, 1978; J. S. Wiggins, 1973/1988, pp. 356-363). When 
checked against a variety of practical critería dependent on interpersonal relations, 
such ratings usually have been found to have good concurrent and predictive valid- 
ity. These findings are understandable when we consider some of the features of peer 
ratings. First, the number of raters is generally large, including all group members. 
Second, an individual's peers are often in a particularly favorable position to observe 
her or his typical behavior. They may thus be better judges of certain interpersonal 
traits than teachers, supervisors, and other outside observers. Third, and probably 
most important, is the fact that the opinions of group members—right or wrong— 
influence their actions and hence partly determine the nature of the individual's 
subsequent interactions with the group. Other comparable groups may be expected 
to react toward the individual in a similar fashion. Sociometric ratings may thus be 
said to have content validity in the same sense as work samples. 


Checklists and Q Sorts. Any self-report instrument, such as the personality 
and interest inventories discussed in chapters 13 and 14, may also be employed by 
an observer in describing another person.’? Measures designed to assess self- 


24When negative nominations are obtained, examiners should be particularly careful to prevent any potentially 
deleterious effects of the procedure. For a discussion of the ethics of peer-referenced assessment with children 
and adolescents, see Gresham and Little (1993, pp. 174-175) and Kamphaus and Frick (1996, pp. 201-203). 

5The Personality Inventory for Children (PIC) and the Personality Inventory for Youth (PIY)—described in 


chapter 13—are, in fact, parallel instruments in (a) self-report form (PIY) and (b) in the form of an inventory of 
observed behavior (PIC). 
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concept are especially well suited to this purpose. The Adjective Check List 
(ACL) has been used extensively to obtain observers’ evaluations in the IPAR 
research program (Gough & Heilbrun, 1983). Trained psychologists who have 
observed the participant closely over a two- or three-day assessment period record 
their evaluations by checking the appropriate adjectives on the list. 

The Q sort has also been widely used for observer evaluations. Block 
(1961/1978) originally developed the California Q-Sort Deck to provide a stan- 
dard language for comprehensive personality evaluations by professionally trained 
observers. The deck was subsequently published for more general distribution and 
revised to make its language appropriate for lay users as well as professionals. An 
adaptation for use at younger ages, the California Child Q-Set, is also available 
(for review, see Heilbrun, 1985). For all forms, the materials consist of 100 state- 
ments to be sorted into a 9-point forced distribution. The statements are sorted 
with regard to their “salience” for the individual—that is, their importance in 
specifying the unique and essential characteristics of the person who is being 
evaluated. Thus, the ipsative frame of reference typical of Q sorts is retained; the 
individual is not compared with outside normative standards. 

The availability of such uniform sets of Q-sort items facilitates communication 
and assures comparability of data from different observers. The standard O sort 
can also be used for a number of other research purposes (see, e.g., Caspi et al., 
1992; Reise & Oliver, 1994; Wink, 1992). Another application involves the use 
of this technique in individual assessments. In this connection, Block 
(1961/1978) provides examples of three “defining Q sorts,” representing a con- 
sensus evaluation of an optimally normal individual and two psychiatric syn- 
dromes, with which a given individual’s Q sort can be compared. Similar defining 
Q sorts may be developed for any desired category of persons. 


BIODATA 


In the discussion of interviewing techniques earlier in this chapter, reference was 
made to the importance of life-history data. Information about a person’s past be- 
havior and experiences is of interest both to theoreticians trying to understand 
patterns of personality and cognitive development and to applied psychologists 
trying to assess individuals and predict behavior. This is so primarily because the 
way an individual has responded in particular situations in the past is a promising 
source of information on how that individual will respond to similar situations in 
the future. l 
Life-history data can be obtained through several methods, of which inter- 
viewing and questionnaires are probably the most ubiquitious, especially in clini- 
cal and counseling psychology. Diaries and autobiographical writings also provide 
a rich source of information for psychobiographers and others interested in the 
study of individual lives (see, e.g. J. S. Wiggins & Pincus, 1992, pp. 487-493). 


26The use of autobiographical memories as a projective tool is discussed in chapter 15. 
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And investigators who conduct longitudinal studies not only collect but also cre- 
ate records of life-history data through repeated observation and measurement of 
their subjects over time (Funder, Parke, Tomlinson-Keasey, & Widaman, 1993). 

The most structured method of gathering and using life-history data, however, 
is through the biographical inventories or scales—now collectively referred to as 
biodata measures—devised for the prediction of performance in. industrial and 
educational settings. Like the personality and interest inventories surveyed in 
chapters 13 and 14, the biographical inventory is a standardized self-report instru- 
ment in which responses to items are selected from among two or more options 
rather than constructed by the respondent. Although most of their questions gen- 
erally pertain to relatively objective and readily verifiable facts, when. they. are 
scored and used for assessment or prediction, biographical scales—like structured 
interviews—must meet the same psychometric standards of reliability and valid- 
ity as any other test. Typical iteras deal with amount and nature of education, job 
experiences, special skills, hobbies, and recreational activities. Frequently the in- 
dividual’s reaction to prior experiences is sought, as when respondents are asked 
about the courses they liked best and least in school, or what they liked or disliked 
in their job experiences.” 

Historically, biographical scale items have been selected and weighted by cri- 
terion keying, as in the construction of such inventories as the MMPI and the 
Strong, discussed in chapters 13 and 14. The resulting inventory is then cross- 
validated against the same criterion in a new sample. When these procedures are 
followed, biodata inventories have proved to be consistently good predictors of 
performance in a wide variety of contexts. Inventories have been developed 
against such varied criteria as amount of insurance sold by life insurance agents, 
job turnover of bank clerks, productivity of research scientists, artistic creativity 
of high school students, and performance of naval personnel in diver training. 
Such measures have proved valid as predictors of job performance in groups 
ranging from unskilled and blue-collar workers to professionals and high-level 
executives (Anastasi, 1979, pp. 79-80; Owens, 1983). At the same time, a limi- 
tation of many biographical inventories developed for specific jobs by empirical 
methods has been that they are not transportable or generalizable. Their applic- 
ability tends to be limited to the particular settings and criteria employed in 
their development (Baird, 1985; Hunter & Hunter, 1984). 

Biographical measures continue to be the subject of extensive research. There 
has been considerable exploration of rational and factor analytic approaches to the 
development of biodata scales (Hough & Paullin, 1994; Schoenfeldt & Mendoza, 
1994). In contrast to the empirical selection and criterion-keying of items, the ra- 
tional approach usually begins with the identification of major relevant constructs 
through job analyses and surveys of published sources in both the empirical and 
theoretical literature. This is followed by factor analyses of the preliminary item 
pool, from which final items are chosen for the resulting factor scales. The relative 


77A thorough presentation of the domain and attributes of biodata items can be found in Mael (1991). 
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merits of the various approaches to the construction of biodata scales are a subject 
of continued debate, although it is clear that each approach has strengths as well as 
weaknesses. The ideal approach is a construct validation strategy that incorporates 
as many varied elements as possible (Hough & Paullin, 1994). 

In addition to the traditional approaches to inventory development, new 
methods of generating, selecting, and keying biodata items are being tried in an 
effort to make the resulting instruments more generalizable and transportable 
(see, e.g., Rothstein, Schmidt, Erwin, Owens, & Sparks, 1990; Russell, Mattson, 
Devlin, & Atwater, 1990). William A. Owens pioneered one of the most produc- 
tive avenues of research which uses clustering techniques to identify subgroups of 
individuals who share modal patterns of life-history experiences (Mumford, 
Stokes, & Owens, 1990; Mumford & Stokes, 1992; Owens & Schoenfeldt, 1919). 
Guch findings can then be applied to the prediction of multiple criteria. More im- 
portantly, the study of subgroups identified through these methods can lead to a 
level of understanding of the developmental patterns in the lives of individuals 
that combines elements of the idiographic and nomothetic approaches (Hein & 
Wesley, 1994). 

Today, after years of carefully controlled research, biodata measures are among 
the most dependable and effective means of assessment and selection in education, 
industry, government, and other contexts. Furthermore, the history of biodata re- 
search highlights, in an unusually clear fashion, the mutual interdependence of 
basic and applied science. To wit, the methods and results of studies conducted to 
solve pragmatic problems of personnel selection are contributing to the formula- 
tion of a theoretical framework for understanding the development of behavior 
patterns across a lifetime. 

Nevertheless, the implementation of biodata measures is not without its prob- 
lems, which are practical and political, as well as technical. Chief among these 
problems are concerns about (a) legal issues, such as invasion of privacy, and 
equal employment opportunity matters (Sharf, 1994), and (b) the susceptibility 
of biodata measures and retrospective reports to faking and other sources of inac- 
curacy (Henry, Moffitt, Caspi, Langley, & Silva, 1994; Lautenschlager, 1994; 
Trent & Laurence, 1993). Much of the available research on all aspects of the de- 
velopment and use of biographical instruments has been summarized in the Bio- 


data Handbook (Stokes, Mumford, & Owens, 1994). 


ajor Contexts of Current 
Test Use 


sychological tests are used for a wide variety of purposes, and their areas of 
application are continually expanding. Having reviewed representative ex- 
amples of various types of tests in some detail, we now turn to a considera- 
tion of issues that pertain to their application. In this chapter, we consider 
three major areas within which tests serve a multiplicity of functions; these areas 
of test use can be broadly designated as the educational, occupational, and clini- 
cal/counseling contexts. In the next, and final, chapter we discuss ethical and so- 
cial issues that relate to testing practices across all areas. 


EDUCATIONAL TESTING 


Nearly every type of available test is used in the schools. Intelligence, special apti- 
tude, multiple aptitude, and personality tests can all be found in the repertoire of 
the educational counselor and the school psychologist. Teachers and educational 
administrators frequently have to act on the results obtained with several different 
kinds of tests. Certain types of tests, however, have been specifically developed for 
use in educational contexts.! It is with such tests that this section is concerned. 
They include instruments designed for prediction and classification within specific 
educational settings and many varieties of educational achievement tests. 


‘Both the 1985 Testing Standards (AERA, APA, NCME) and their proposed revision (see chap. 1) have a chap- 
ter devoted to the use of tests in education. 
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Achievement Tests: Their Nature and Uses. Surpassing all other types of 
standardized tests in sheer numbers, achievement tests are designed to measure the 
effects of a specific program of instruction or training. It has been customary to 
contrast achievement tests with aptitude tests, the latter including general 
intelligence tests, multiple aptitude batteries, and special aptitude tests. From one 
point of view, the difference between achievement and aptitude testing is a difference 
in the degree of uniformity of relevant antecedent experience. Thus, achievement 
tests measure the effects of relatively standardized sets of experiences, such as a course 
in elementary French, trigonometry, or computer programming. In contrast, aptitude 
test performance reflects the cumulative influence of a multiplicity of experiences in 
daily living. We might say that aptitude tests measure the effect of learning under 
relatively uncontrolled and unknown conditions, whereas achievement tests measure 
the effects of learning that occurred under partially known and controlled conditions. 

A second distinction between aptitude and achievement tests pertains to their 
respective uses. Aptitude tests serve to predict subsequent performance. They are 
employed to estimate the extent to which the individual will profit from a speci- 
fied course of training, or to forecast the quality of her or his achievement in a 
new situation. Achievement tests, on the other hand, generally represent a termi- 
nal evaluation of the individual’s status on the completion of training. The em- 
phasis in such tests is on what the individual can do at the time. 

It should be recognized, however, that no distinction between aptitude and 
achievement tests can be applied rigidly. Some aptitude tests may depend on 
fairly specific and uniform prior learning, and some achievement tests cover rela- 
tively broad and unstandardized educational experiences. Similarly, an achieve- 
ment test may be used asa predictor of future learning. As such, it serves the same 
function as an aptitude. test. For example, achievement tests on premedical 
courses can serve as predictors of performance in medical school. 

In an effort to avoid the excess meanings that have become associated with the 
terms “aptitude” and “achievement,” the more neutral term “ability” is being sub- 
stituted more and more to designate measures of cognitive behavior.” Any cogni- 
tive test, regardless of what it has been called traditionally, provides a sample of 
what the individual knows at the time he or she is tested and measures the leve 


of development attained in one or more abilities. No test reveals how or why the 


individual reached that level. To answer the latter questions, one must delve into 
other concomitant variables, and especially into the person's experiential back- 
ground. In. this sense, every test score has a past that needs to be explored for 
proper understanding of the individual who obtained it. The same test score also 
has a future, insofar as it permits some prediction of what that individual will do 


in other, nontest situations, a5 well as at some future time. 


regard to test labels is provided by the new names for the 
College Board examinations, which became effective in 1994. The widely known SAT abbreviation now stands 
for Scholastic Assessment Test, rather than Scholastic Aptitude Test. The new SAT has been regrouped into 
SAT-I: Reasoning Test, which replaces the former Scholastic Aptitude Test, and the 
Achievement Tests. These changes were accompanied by other, 
IL be discussed in a later section of this chapter. 


Ai Ae OR co 
2One distinct example of the changes taking place with 


two components—the 
SAT-II: Subject Tests, which replace the former / 
more substantial innovations in the tests that wi 
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Figure 17-1. Tests of Developed Abilities: Continuum of Experiential Specificity. 


Tests of developed abilities—whether designed as general intelligence tests, 
multiple aptitude batteries, special aptitude tests, or achievement tests—may be 
ordered along a continuum in terms of the specificity of experiential background 
that they presuppose. This continuum is illustrated in Figure 17-1. At one ex- 
treme are the course-oriented achievement tests covering narrowly defined tech- 
nical skills or factual information. A test in Russian vocabulary or television 
maintenance would fall at this end of the continuum. Next come the broadly ori- 
ented achievement tests commonly used today to assess the attainment of major, 
long-term educational goals. Here we find tests focusing on the understanding 
and application of scientific principles, the interpretation of literature, or the ap- 
preciation of art. Still somewhat broader in orientation are tests of basic cognitive 
skills—such as reading comprehension, arithmetic computation, and logical rea- 
soning—that affect the individual's performance in a wide variety of activities. It 
is apparent that here achievement tests fuse imperceptibly with traditional intel- 
ligence and aptitude tests? The predominantly verbal cognitive batteries, tradi- 
tionally designated as intelligence tests, are close to the broadest achievement 
tests. Next come nonlanguage and performance tests, usually requiring no reading 
or writing. At the extreme generality end are the cross-cultural tests designed for 
use with persons of widely varied experiential backgrounds. 

Labeling some instruments as "aptitude tests" and others as "achievement 
tests" has led to certain misuses of test results. A common example is the practice 
of identifying as underachievers those children whose achievement test scores are 
lower than their scholastic aptitude or intelligence test scores. Actually, such 
intraindividual differences in test scores reflect the fact that no two tests (or 
other performance indicators such as course grades) correlate perfectly with each 


?This overlapping can be demonstrated empirically through an examination of the close similarity of content, 
and the degree of correlation, between tests with such labels (see, e.g., W. Coleman & Cureton, 1954; Cooley & 
Lohnes, 1976). 
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other. The question of under- or overachievement can be more accurately formu- 
lated as overprediction or underprediction from the first to the second test. 
Among the reasons for the prediction errors in individual cases are the unreliabil- 
ity of the measuring instruments, differences in content coverage, the varied 
effects of attitudinal and motivational factors on the two measures, and the im- 
pact of such intervening experiences as remedial instruction or a long illness 
(R. L. Thorndike, 1963). 

The many roles that achievement tests can play in the educational process 
have long been recognized. As an aid in the assignment of grades—or in any 
other assessment of achieved competence—standardized achievement tests have 
the advantages of objectivity, uniformity, and efficiency. If properly constructed, 
they have other merits, such as adequacy of content coverage and reduction of 
the operation of irrelevant and chance factors in scoring procedures. Achieve- 
ment tests also constitute an important feature of remedial teaching programs. In 
this connection, they can be useful both in the identification of students with 
special educational disabilities and in the measurement of progress in the course 
of remedial work. 

For all types of learners, the periodic administration of well-constructed and 
properly chosen achievement tests may serve to facilitate learning. Such tests can 
reveal weaknesses in past learning, give direction to subsequent learning, and mo- 
tivate the learner. The incentive value of "knowledge of results" has been repeat- 
edly demonstrated by psychological experiments in many types of learning 
situations, with learners of widely varying age and education. The effectiveness of 
such self-checking is generally heightened by immediacy. 

From another angle, achievement tests provide a means of adapting instruction 
to individual needs. Teaching can be most fruitful when it meets the learner at 
whatever stage he or she happens to be. Ascertaining what individuals are already 
able to do and what they already know about a subject is thus a necessary first step 
for effective teaching. By giving tests at the beginning of the school year, educa- 
tors can take constructive steps to fill the major gaps in knowledge revealed by 
test results. Further examples of the role of achievement tests in the teaching 
process can be found in connection with domain-referenced testing and the indi- 
vidually tailored instructional systems discussed in chapter 3. Á f 

Finally, achievement tests may be employed as aids in the evaluation and im- 


provement of teaching and in the formulation of educational goals. Achievement 


tests can provide information on the adequacy with which essential content and 


skills are actually being taught. By focusing attention on such issues and by Mess 
ing concrete facts, achievement tests stimulate an analysis of seed 
objectives and encourage a critical examination of the content and methods 
of instruction: Since public demands for accountability have risen, in the past few 
decades, there has been an unprecedented amount of testing, at educational 


^ i cati e.g., Penner, Batsche, Knoff, & 
4Recent publications in the areas of mathematics and science education (see, Gë: BREAD 
Nelson, 1993) cn in the enhancement of thinking skills (see; e.g Mulcahy, e zc & Andrews, 1991) exem- 
plif the kind of sound, solution-oriented work that can result from such critica! PRETT 
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institutions of all levels. Much of this testing is mandated by states and local 
school districts, as well as by the federal government (B. Gifford, 1989b; Hartle & 
Battaglia, 1993; National Council on Education Standards and Testing, 1992). 
The National Assessment of Educational Progress, known informally as “the na- 
tion’s report card,” is one of the best known examples of an ongoing government- 
mandated testing program (see, e.g, Alexander & James, 1987; Gentile, 
Martin-Rehrmann, & Kennedy, 1995; E. G. Johnson, 1992; Messick, Beaton, & 
Lord, 1983; NAEP, 1985; E B. Womer, 1970). Because of the pervasiveness of stan- 
dardized tests and the wide-ranging economic consequences that may be linked to 
their results, the tests themselves have been subjected to intense scrutiny and crit- 
icism. The methods used to assess educational progress, once the exclusive 
province of test specialists, have thus become the focus of a highly politicized de- 
bate that has received the attention of legislators and industrial leaders, as well as 
the general public (R. E. Bennett & Ward, 1993; Courts & Mclnerney, 1993; 
Gifford & O'Connor, 1992; S. P. Robinson, 1993; G. P. Wiggins, 1993). For a dis- 
cussion of the issues and trends involved in school testing and assessment programs 
that are mandated by external sources, see Linn and Gronlund (1995, chap. 18). 


Construction versus Choice. Historically, the traditional school exami- 
nations consisted of a set of questions to be answered orally or in writing. In either 
case, the examinee composed and formulated the response. The term "essay 
question" came to be used broadly to cover all free-response questions, including 
not only those demanding a lengthy essay but also those requiring the examinee 
to produce a short answer or to work out the solution for a mathematical problem. 
"Objective questions," by contrast, were those that called for the choice ofa 
correct answer out of the alternatives provided for each question. Although there 
are several kinds of items that require examinees to select a response, such as 
true-false and matching, the multiple-choice question has been, by far, the most 
widely used, the most thoroughly studied, and, also, the most frequently criticized 
type of test item. 

Critics of the multiple-choice format argue that it promotes rote memorization 
and learning of isolated facts rather than development of problem-solving skills 
and conceptual understanding. In addition, many uninformed people within the 
educational and political establishments equate multiple-choice items with stan- 
dardized testing and disparage both of these elements of assessment methodology 
at once.? In an ironic turn of events, the same standardized testing programs used 
to chart educational progress often have been seen as contributing to the educa- 
tional deficits they have uncovered (Courts & McInerney, 1993; H. Gardner, 
1992; Resnick & Resnick, 1992). Unfortunately, the criticisms about excessive 
and inappropriate use of standardized tests have been thoroughly justified in some 
cases. At any rate, charges that testing drives the curriculum and that both are in 


Sir should be noted that the instruments used in many large-scale educational testing programs—such as the 
SAT, the NAEP tests, and other standardized achievement measures—have been using essay questions and 
other open-ended item formats for some time. 
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urgent need of reform have emanated from educators at all levels and have grown 
increasingly stronger in the past two decades. Advocates of educational reform 
believe that a major overhaul is needed in curricular goals and instructional 
methods, as well as in the tools of assessment, and they perceive all of these areas 
as inextricably tied. 

Since the philosophical, political, and practical aspects of educational reform 
are beyond the scope of this book, we shall confine ourselves to discussing some of 
the proposed alternatives in assessment methodology. These alternatives are de- 
scribed by various rubrics, such as “performance-based” assessment, “authentic” 
assessment, and “direct” assessment (see, e.g., E. L. Baker, O'Neil, & Linn, 1993; 
Linn & Gronlund, 1995, chap. 10). Although each of these represents somewhat 
different emphases, one of the central features they all share is a marked prefer- 
ence for tasks that—like the former "essay questions"—require the examinee to 
generate an answer. These items are now described as constructed-response or open- 
ended tasks. They are contrasted with selected-response tasks, which is the term ap- 
plied to items that require the examinee only to choose a response from among 
the options presented, such as multiple-choice items and the other types of ques- 
tions that used to be called “objective.” Constructed-response items may involve 
simple completions (such as “fill-in-the-blanks”), problem solving, and essays, as 
well as a variety of presentations or performances like playing an instrument, giv- 
ing a speech, or repairing a machine. 

The method known as portfolio assessment provides another set of alternatives. 
This type of evaluation tool is aimed primarily at making the process of educa- 
tional assessment as meaningful and realistic as possible. Although there are a 
wide range of procedures to which the term is applied, a portfolio usually consists 
of a cumulative record—collected over an extended period of time—of samples of 
students’ work in specific areas, such as writing or any other endeavor that in- 
volves a process in which progress can be documented (Camp, 1993; Gitomer, 
1993; D. P. Wolf, 1993). The portfolio method of assessment offers a great deal of 
flexibility and can be implemented more or less formally and with various degrees 
of collaboration between the student and the teacher (see Karlsen, 1992, for an 
example of a published instrument of this type). 

The reader will have gathered even from this brief overview that a great deal of 
attention is being paid to the means by which evaluations of learning and of stu- 
dents’ work are conducted. This concern extends not only to what different items 
measure and how well they measure it, but also to other psychological aspects of 
test items. For example, Zeidner (1993) has investigated students’ attitudes to- 
ward item formats and found that they prefer multiple-choice items rather than 
essays. Lu and Suen’s (1995) research indicates that performance-based assess- 
ment tends to favor field-independent over field-deperident students (see chap. 
16). Other investigators have looked into the relationship between test anxiety 


fied along various other dimensions besidies whether they involve construc- 


STest items can, of course, be classi ) 
les of two different taxonomies of item types can be found in R. E. Bennett 


tion or selection of responses. Examp! 
(1993) and R. E. Snow (1993). 
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and item types and found that scores on constructed-response tests seem to be 
more affected by anxiety than those of selected-response tests (Crocker & 
Schmitt, 1987). A discussion of variables that may impinge on the motivation of 
test takers and affect their responsiveness and performance—for example, the 
purpose of an assessment procedure—as well as several other factors that may 
have a bearing on the interpretation of constructed-response and multiple-choice 
tests can be found in R. E. Snow (1993). 

At the same time, the empirical literature concerning the strictly psychometric 
properties of performance-based tasks used in academic settings has been accumu- 
lating gradually.’ Both the pace of the research and the direction of the results 
differ widely depending on the specific types of items in question. A fairly large 
number of studies have investigated the reliability of the scoring procedures used 
for constructed-response tasks, which—like those applied to the situational tests 
discussed in chapter 16—often consist of ratings (E. L. Baker et al., 1993; Linn & 
Gronlund, 1995, chap. 10). Interrater reliabilities are generally quite favorable 
when the scoring rules are clear and well developed and when the raters are prop- 
erly trained. On the other hand, the generalizability of results across topics and 
tasks is typically low, which suggests that constructed-response items have a rela- 
tively high degree of task specificity. This finding is not surprising, considering 
that such items are typically more complex and allow a wider range of responses 
than do selected-response items. 

With regard to the issue of validity, the research base is still rather limited, at least 
for the least restrictive and most innovative types of constructed-response tasks. 
One of the most crucial preliminary questions is the extent to which constructed- 
response and selected-response items measure equivalent traits or skills. Although 
the evidence on this issue is not yet extensive, Traub's (1993) review of studies on 
the subject suggests that the degree of equivalence varies across domains. For in- 
stance, when different item formats are used in tests of reading comprehension or 
mathematical knowledge, they tend to yield equivalent results—whereas, in the 
writing domain, it appears that the type of item employed does have a significant 
effect on scores. : 

Meanwhile, the use of rnultiple-choice items is still pervasive. In fact, the in- 
troduction of different itern formats into educational tests, coupled with the se- 
vere criticism to which multiple-choice items have been subjected, appears to 
have stimulated efforts to improve them. Guidance on how to develop, review, 
and evaluate multiple-choice items is readily available (see, e.g., Haladyna, 
1994), and research on specific aspects of this format, such as the optimum num- 
ber of options to use, continues (Trevisan, Sax, & Michael, 1991, 1994). Further- 
more, a number of variations and refinements of the genre of selected-response 
tasks are also being explored and disseminated (see, e.g., Linn & Gronlund, 1995, 
chap. 8; Sax, 1991; Sireci, Thissen, & Wainer, 1991; Wainer & Kiely, 1987; 
Wainer & Lewis, 1990). 


"See, especially, R. E. Bennett and Ward (1993). 
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Direct comparisons between constructed-response and selected-response items 
have been carried out as well (e.g., Lukhele, Thissen, & Wainer, 1994). Much of 
the time, such comparisons seem to favor multiple-choice items with regard to 
criteria such as economy, efficiency, and predictive validity, especially when they 
are compared with the traditional essay questions (Anastasi, 1988b, pp. 416-418; 
R. E. Bennett, 1993). It must be noted, however, that the problem of evaluating 
and comparing different assessment formats at a time when both the goals and the 
methods of assessment are in a state of flux is anything but straightforward. More- 
over, it is important to keep in mind that item format is only one among many 
variables that interact with one another to determine the fairness, accuracy, and 
overall quality of an assessment procedure. The purpose of the assessment and the 
content domain in question, as well as the characteristics of the individuals who 
are assessed, all need to be considered (E. L. Baker et al., 1993; R. E. Bennett, 
1993; Dwyer, 1993; Mislevy, 1993). For instance, issues such as the differential ef- 
fects of failure on the subsequent motivation of test takers may be placed ahead of 
other criteria for evaluating items, especially for students who have handicapping 
conditions or are otherwise disadvantaged. However, it is worth noting that, at 
present, there is no reason to believe that performance-based assessment results in 
narrowing the gap that has existed between the scores of Whites and those of 
some ethnic minorities on standardized tests of the multiple-choice variety; in 
fact, some research indicates that this type of score disparity may be greater with 
constructed-response than with selected-response tests (Hartle & Battaglia, 
1993), Current issues in the educational assessment of culturally and linguisti- 
cally diverse students are discussed further by Cancelli and Arena (1996), 
K. W. Howell and Rueda (1996), and Shinn and Baker (1996). 


Types of Educational Tests 


In the 1990 Annual Report of the Educational Testing Service, the Board of 
Trustees of that organization predicted that educational testing would change 
mote in the succeeding 10 years than it had in the previous 50 (ETS, 1990). It 


now appears that this prediction was accurate and could be reiterated easily for 


the coming decade. Tests of all kinds are undergoing significant revisions and new 
are being devised at a rapid pace. The following review, there- 
the types of instruments used traditionally in educational con- 

ther than on 


assessment tools 
fore, centers On : 
texts—along with some ongoing innovations within each type—ta 


detailed descriptions of individual tests. 


General Achievement Batteries. Several batteries are available for mea- 
suring general educational achievement in the areas most commonly covered by 
academic curricula. This type of test can be used from the primary grades to the 
adult level, although its major application has been in the elementary school. 
Typically, these batteries provide profiles of scores on individual subtests or in 
major academic areas. An advantage of such batteries, as against independently 
constructed achievement tests, is that they may permit horizontal or vertical 
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comparisons, or both. Thus, an individual's relative standing in different subject- 
matter areas or educational skills can be evaluated in terms of a uniform 
normative sample. Or the student's progress from grade to grade can be reported 
in terms of a single score scale. The test user should check whether a particular 
battery was so standardized as to yield either or both kinds of comparability. 

Although some general achievement batteries are designed exclusively for the 
elementary grades or for high school, most span a broad range extending into 
both levels and occasionally even into the first year of college. A few provide a 
single battery for the grade range covered, but the large majority have several 
overlapping batteries in separate test booklets for use at different levels. Some 
batteries actually form a coordinated series, permitting comparable measurement 
from Grades K through 12. One such combination is the Iowa Tests of Basic 
Skills, the Tests of Achievement and Proficiency, and the Iowa Tests of Educa- 
tional Development; another is the Stanford Achievement Test Series. 

A noteworthy feature of some achievement batteries is that they were concur- 
rently normed with tests of academic intelligence or scholastic abilities. Major 
examples include achievement batteries that were paired with three multilevel 
tests illustrated in chapter 10, namely, the Stanford Achievement Test Series 
with the Otis-Lennon School Ability Test; the Iowa Tests series and Tests of 
Achievement and Proficiency with the Cognitive Abilities Test; and the Califor- 
nia Achievement Tests and Comprehensive Tests of Basic Skills with the Test of 
Cognitive Skills (see Table 10-1). The use of the same standardization sample in 
these cases permits direct comparisons of the scores any student obtains on the 
two types of tests. Usually, the two tests correlate highly, and individuals will ob- 
tain closely similar scotes on them. For students who score substantially higher on 
either type of test, it is desirable to explore possible reasons for the discrepancy. 
The achievement batrery measures largely what the individual has learned in 
basic school courses; the cognitive skills test assesses a broader range of abilities 
and knowledge learned both in and out of school. Any significant performance 
discrepancy could reflect the influence of specific abilities or disabilities, or of 
such noncognitive factors as motivation, interests, and attitudes. The individual's 
experiential background often provides clues to the conditions leading to unusual 
discrepancies in test performance. 

Achievement batteries obviously differ in the technical level of their test con- 
struction procedures. Nevertheless, as a group, they meet high standards of test 
development, especially with regard to size and representativeness of normative 
samples, reliability, and content validation. Detailed test specifications for item 
writing are generally followed by thorough item analyses, including the applica- 
tion. of item response theory methods. Special procedures are commonly em- 
ployed for avoiding gender and ethnic bias. All batteries typically include an 
assessment of basic skills in reading, language, and mathematics, in combination 
with varying amounts of content knowledge in science and social studies. Several 
include subtests designed to measure study skills or ability to use various informa- 
tion sources. Moreover, in response to demands for more flexible and informative 
assessment tools, publishers of the major standardized achievement batteries now 
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offer a greater variety of items and options. Open-ended items and a wider range 
of selected-response items—designed to measure higher order thinking skills and 
set within more meaningful contexts—are currently being used. Publishers pro- 
vide increased flexibility in adapting assessment packages to the requirements of 
local curricula, using various mixes of item content and formats as well as scorin 
options. They also offer more linkages between tests and instructional materials. 


Tests of Minimum Competency in Basic Skills. The past two decades 
witnessed a mounting concern about the low competence level of many high 
school graduates in reading, writing, and arithmetic skills. This concern led to 
popular demands for competency tests in basic skills as a means of certifying the 
attainment of minimum competency and as a basis for awarding a high school 
diploma. Such demands aroused a storm of controversy; most of the objections 
centered on the likelihood of misuses and misinterpretations of minimum- 
competency tests, and on the educational rigidities and bureaucratic controls that 
might ensue.” While most states have established policies regarding minimum- 
competency testing, their policies and procedures vary widely in the time and 
grade level at which the tests are administered, the particular use made of results, 
and the nature and degree of local autonomy in the development or choice of 
tests. Furthermore, tests that are used to make decisions about granting or 
withholding diplomas must reflect the specific curricula they cover. Because of all 
these reasons, appropriate tests are DCH usually developed by schools, school 
systems, or state agencies themselves, often with the assistance of test publishers 
who can provide technically trained personnel, large item banks, and reporting 
services that can be adapted especially to meet local objectives. Such tests may 
include some components of the standardized achievement batteries discussed 
earlier or they may be custom-made for a specific locality. 

In recent years the focus of interest in ascertaining mastery of basic skills has 
extended to the adult population." The cumulative effects of large school 
dropout rates and low levels of competence among high school graduates, as wel 
as the great rise in the number of non-English-speaking immigrants, have fostered 
concerns about the competitiveness of the U.S. labor force in the world market. 
The findings of the National Adult Literacy Survey, conducted in 1992 by ETS 


KEEN REED. 
5An example of this trend is the TerraNova series, recently published by CH B/McGraw-Hill. The components 
); a Multiple Assessments edition, 


of this series include: the new Comprehensive Tests of Basic Skills (CTBS ` 0 
which combines selected-response and constructed-response items; à Performance Assessments edition, which 


provides longer open-ended tasks scored locally or by the publisher; and a Custom Component, which offers 


supplemental items designed to assess specialized curriculum objectives. 

For a comprehensive discussion of the minimum-competency testing movement and the technical psychomet- 
ric problems involved therein, see Berk (1986). Another thorough review of the issues and problems involved in 
the use of tests to certify student competence can be found in Jaeger (1989). 

10The concept of “functional literacy” (Sticht, 1915)—which has been extended to mean competence in the 
use of language in speaking and writing, as well as in understanding and using various documents, and in arith- 
derlies this discussion. Functional competence is defined in terms of the demands of 


metic computation —un 4 6 i E 
practical situations, such as the difficulty level and amount of reading required to perform particular jobs or, 
life in a modern society. 


more broadly, the basic educational skills required to manage one's own 
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under the auspices of the Department of Education, have reinforced this concern. 
The survey revealed that almost one half of the U.S. population placed at the 
lowest two (out of five) possible levels of literacy (Kirsch, Jungeblut, Jenkins, & 
Kolstad, 1993). 

Tests of competency in basic skills designed especially for adults are typically 
developed in connection with adult education classes, educational programs con- 
ducted in penal institutions, and job training programs. An outstanding example 
is the Tests of Adult Basic Education (TABE, Forms 7 & 8, 1994). The TABE bat- 
tery encompasses five graduated levels of difficulty across five different content 
areas that include reading, language, and applied mathematics. The results are re- 
ported as norm-referenced scores and also in terms of competency-based informa- 
tion that can be used diagnostically. In addition to its regular forms, the TABE is 
available in a special version suitable for use in a work environment and in a 
Spanish edition (TABE Español) designed to measure the basic skills of Spanish- 
speaking adults in their primary language. 


Teacher-Made Classroom Tests. Undoubtedly the largest number of tests 
covering the content of specific courses or parts of courses are prepared by 
instructors for use in their own classrooms. The vast diversity among courses on 
the same subject and with identical titles—especially at the high school level and 
beyond—is well known. Under these conditions, no external standardized test 
could suffice. The preparation of local classroom tests, however, can be 
substantially improved by an application of the techniques and accumulated 
experience of professional test constructors. The development of classroom tests 
can be divided into three major steps: (1) planning the test, (2) item writing, and 
(3) item analysis. Some simple techniques of item analysis, suitable for use with 
small groups, were described in chapter 7. A brief overview of the other two steps 
is given in the following paragraphs. +! 

The test constructor who plunges directly into item writing is likely to produce 
a lopsided test. Without an advance plan, some areas will be overrepresented, 
whereas others may remain untouched. It is generally easier to prepare objective 
items on some topics than on others. And it is easier to prepare items that require 
the recall of simple facts than to devise items calling for critical evaluation, inte- 
gration of different facts, or application of principles to new situations. Thus, the 
test constructed without a blueprint is likely to be overloaded with relatively 
impermanent and less important material. Many of the popular criticisms of 
selected-response tests stem from the common overemphasis on rote memory and 
trivial details in poorly constructed tests. 

To guard against these fortuitous imbalances of item coverage, test specifications 
should be drawn up before any items are prepared. For classroom examinations, 
such specifications should be based on outlines of instructional objectives and 
content areas to be covered and should reflect decisions about the relative impor- 


(for further guidance on the preparation of classroom tests and other assessment procedures, see Linn and 
Gronlund (1995, chaps. 5-13). 
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tance of each of these aspects through the number of items allotted to each topic 
and objective. The test constructor must also decide on the most appropriate 
item form for the material. The relative merits of objective and open-ended items, 
discussed in an earlier section of this chapter with special reference to their use in 
large-scale, standardized testing programs, also need to be considered in the con- 
struction of classroom tests. Finally, many practical rules for effective item writing 
have been formulated on the basis of years of experience in preparing items and 
empirical evaluation of responses. Anyone planning to prepare a classroom test 
would do well to consult one of the sources summarizing these suggestions (e.g., 
Ebel, 1979, chaps. 4-9; Haladyna, 1994, chaps. 4-6; Linn & Gronlund, 1995, 
chaps. 6-9; Millman & Green, 1989). 


Tests for the College Level. "A number of tests and testing programs have 
been developed for use in the admission, placement, and counseling of college 
students. One that is outstanding is the Scholastic Assessment Tests (SAT) 
Program of the College Board, presently composed of two elements, namely, the 
SAT I: Reasoning Test, which has replaced the Scholastic Aptitude Test’s Verbal 
and Mathematical sections, and the SAT II: Subject Tests, which have replaced 
the former SAT Achievement Tests.'? The SAT I consists primarily of multiple- 
choice questions measuring verbal and mathematical abilities. It is meant for 
use—as a supplement to high school grades and other information—in the 
assessment of a student's readiness to perform college-level work. The SAT II 
tests, on the other hand, are designed to assess knowledge in specific subject areas 
(e.g., literature, chemistry, and world history ) and can be used for placement as 
well as for admissions. 

The SAT Program tests have changed in content and format, as well as in 
name. For instance, in the Verbal section of the SAT I, there is increased emphasis 
on critical reading and reasoning. Similarly, the Math section now requires students 
to produce, rather than merely to select, some of their answers, and—in keeping 
with current standards—the use of calculators is permitted throughout the Math 
section. The SAT Il tests have also changed and presently include more direct as- 
sessment of skills through the use of listening components and writing samples. 

As of April 1995, the SAT Program test scores are no longer reported in terms 
of a fixed reference group dating back to 1941 (see chap. 3). Instead, the SAT 


score scale has been "recentered;" using a new reference group from the 1990s, so 


that average performance is once again represented by a score of approximately 


500.14 The recentering has made the SAT scores more precise and reliable, espe- 


cially at the highest and lowest ends of the scale. In addition, the interpretation 
of scores is easier; for example, verbal and math scores can now be compared 


Examples of test specification tables can be found in Anastasi (1988b, p. 431) and Linn and Gronlund (1996, 


p. 122). 
13For further historical background on the SAT program, see Anastasi (1988b, pp. 328-331) and Donlon (1984). 
14By the early 1990s, average SAT scores had drifted down from 500 in both areas to 424 in the Verbal area and 


418 in Math. 
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directly with one another without reference to percentile rankings, because both 
have been realigned. Moreover, since a score of 500 is the midpoint of the 200 to 
800 range, the “intuitive” average and the actual average will be the same. The 
College Board has disseminated tables and other tools that simplify the conver- 
sion of original scale scores to recentered scale scores in order to help maintain 
the continuity between the two scales. Information about the reliability, difficulty 
levels, and completion rates of the revised SAT examinations is currently avail- 
able in handbooks prepared especially for counselors and admission officers as 
well as in other ETS and College Board research reports and publications (Col- 
lege Board, 1995a, 1995b).!° A preliminary study comparing the traditional SAT 
with a prototype version of the SAT I found the newer test to be a slightly better 
predictor of freshman GPA than the former (Hale, Bridgeman, Lewis, Pollack, & 
Wang, 1992). Additional evidence on the validity of the revised SAT examina- 
tions will be included in a forthcoming technical supplement that should be 
available in the late 1990s. 

Another nationwide program, launched in 1959, is the American College 
Testing Program (ACT, 1995-1996). Originally limited largely to state university 
systems, this program grew rapidly and is now used in many colleges throughout 
the country. The current ACT Assessment includes four tests: English, Mathe- 
matics, Reading, and Science Reasoning. Reflecting the point of view of its 
founder, E. E. Lindquist, this examination program provides a set of samples of 
college work. It overlaps traditional aptitude and achievement tests, focusing on 
the basic intellectual skills required for satisfactory performance in college. The 
noncognitive components of the ACT Assessment include: a High School 
Course/Grade Information questionnaire, the ACT Interest Inventory, and a Stu- 
dent Profile Section with questions about students' aspirations, plans, accom- 
plishments, and other background information. Historically, the ACT has not 
come up to the technical standards set by the SAT. However, validity data com- 
pare favorably with those found for other instruments in similar settings. 

It should be noted that tests such as the SAT and the ACT are not intended as 
substitutes for high school grades in the prediction of college achievement. High 
school grades can predict college grades as well as or slightly better than most 
tests. When test scores are combined with high school grades, however, the pre- 
diction of college performance is improved. In part, this improvement stems from 
the fact that a uniform, objective test serves as a corrective for the variability in 
grading standards among different high schools. Furthermore, such tests are not 
subject to the possible personal biases or to other arbitrary factors that may enter 
into the assignment of course grades. 

There is also increasing use of specialized achievement tests as college equiva- 
lency examinations. High school students with additional preparation in certain 


'SBecause of their function in the selection of college students, the SAT exams are frequently subject to critical 
scrutiny. Recently, for instance, it was reported that test takers can correctly answer many multiple-choice ques- 
tions based on the SAT reading tasks without referring to the passages that accompany the questions. This find- 
ing has reignited the controversy concerning the extent to which extraneous background knowledge affects 
SAT scores (see, e.g., S. Katz & Lautenschlager, 1995). 
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areas may take tests in the College Board’s Advanced Placement Program (APP) 
to gain admission to college with advanced standing in one or more subjects. A 
related development is to be found in the College Level Examination Program 
(CLEP), also administered by the College Board. The general purpose of this pro- 
gram is to facilitate the granting of college credit by examination and to provide a 
national system for evaluating college-level education acquired through indepen- 
dent study and other nontraditional procedures. A similar test series, the ACT 
Proficiency Examination Program, is administered by the American College Test- 
ing Program. Although it includes some academic subjects, such as Anatomy and 
Physiology and Abnormal Psychology, this program covers mostly occupational 
areas, such as Nursing and Accounting. 


Graduate School Admission: The practice of testing applicants. for 
admission extends to graduate and professional schools. Most of the tests designed 
for this purpose represent a combination of general intelligence and achievement 
tests. A well-known example is the Graduate Record Examinations (GRE). This 
series of tests originated in 1936 in a joint project of the Carnegie Foundation for 
the Advancement of Teaching and the graduate schools of four universities. Now 
greatly expanded, the program is conducted by Educational Testing Service, 
under the general direction of the Graduate Record Examinations Board. 
Students are tested at designated centers—in well over 100 countries around the 
world—prior to their admission to graduate school. The test results are used by 
the universities to aid in. making admission and’ placement decisions and in 
selecting recipients for scholarships, fellowships; and special appointments. The 
GRE include a General Test and Subject Tests in various fields of specialization. 
The current General Test yields separate scores for verbal, quantitative, and 
analytical abilities. The Subject Tests are now available in 16 areas, including 
biology, computer science, French, mathematics, music, political science, and 
psychology. The psychometric characteristics of the GRE are reported in the latest 
edition of the guide to their use (GRE 1995-96 Guide). In general, the Subject 
Test scores are better predictors of first-year graduate grade point average (GPA) 
than either the General Test score composites or undergraduate GPA, but the 
combination of all three measures provides the highest predictive validities. These 
multiple correlations fall between the mid .40s and the low .60s for various fields. 
In October 1992, the GRE Program. started administering a computerized 
version of the traditional form of the General Test, and in November 1993 a 
computer-adaptive General Test was introduced. Despite some initial difficulties 
concerning potential security risks with the computerized GRE General Test, the 
advantages of computerization are such that the GRE Board may eliminate all 
paper-and-pencil testing by as early as 1999. The General Test is also being re- 
vised to include a Writing Test and a Mathematical Reasoning Test, and some 
constructed-response questions (“Update on the New GRE,” 1995). 


at) Do Be 
léprior to 1982, the General Test was called the Aptitude Test and the Subject Tests were called Advanced Tests. 
As with the SAT, the names were changed to avoid potential misunderstanding about the purpose of the tests. 
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Diagnostic and Prognostic Testing. In contrast to the general achievement 
batteries and other achievement tests discussed earlier, the tests discussed in this 
section are designed to analyze a person's particular strengths and weaknesses 
within a subject-matter domain and to suggest causes for her or his difficulties. 
The majority of these diagnostic instruments are individually administered and, 
therefore, most often considered to be clinical tools. Some of them, however, are 
separate components of the major achievement batteries previously discussed and 
are intended for group administration. Most of the published diagnostic group 
tests deal with reading, mathematics, and language skills and offer both normative 
and content-referenced information. Examples of this approach are provided by the 
Stanford Diagnostic Mathematics Test and the Stanford Diagnostic Reading Test, 
and by the California Diagnostic Reading and Mathematics Tests. The publishers of 
these two series have also produced separate instruments for the assessment and 
diagnosis of writing skills. Both the Stanford Writing Assessment Program and the 
CTB Writing Assessment System use direct samples of writing in various modes— 
such as descriptive or narrative writing—and offer several scoring options. 

In connection with the use of all diagnostic tests, one point deserves special 
emphasis. The diagnosis of learning disabilities and the subsequent program of re- 
medial teaching are the proper functions of a trained specialist. No battery of di- 
agnostic tests can suffice for this purpose. The diagnosis and treatment of severe 
learning disabilities require an intensive clinical case study, preferably interdisci- 
plinary, including supplementary information on sensory capacities and motor de- 
velopment, medical and health history, complete educational history, data on 
home and family background, and a thorough investigation of possible emotional 
difficulties. Although survey and group diagnostic tests may serve to identify indi- 
viduals in need of further attention, the diagnosis and treatment of learning dis- 
abilities require specialized techniques. Some of these procedures were discussed 
in chapter 9, and the topic will be considered further in this chapter in connec- 
tion with clinical testing. 

Certain types of tests designed for use in educational contexts are essentially 
prognostic instruments. As such, they function as aptitude rather than as achieve- 
ment tests. At the same time, they frequently resemble achievement tests in con- 
tent, because what they undertake to predict is usually performance in a specific 
course of study. Typical of this approach is the Orleans-Hanna Algebra Prognosis 
Test (Hanna, Sonnenschein, & Lenke, 1983). In this test, students are provided 
with a set of short, simple “lessons” in algebra and are immediately tested on what 
they have learned. The test thus consists of work samples in which the students’ 
later course learning is predicted from their performance in the sample learning 
tasks. A more unusual, and experimental, example of prognostic testing is pro- 
vided by the artifical language tests devised by the U.S. Office of Personnel Man- 
agement and by the Department of Defense in order to predict ability to learn a 
new language (Diane, Brogan, & McCauley, 1991). 

Another method of assessment, although with a completely individualized ori- 
entation, has received increasing attention since the 1980s. Essentially, this ap- 

proach follows a test-teach-test procedure, generally described as dynamic or 
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guided assessment, and is associated with remedial instruction. The individual's 
learning potential is evaluated by observing how well he or she can learn in a one- 
to-one relation with a professional who functions in the triple role of examiner, in- 
structor, and clinician. A major exponent of this method is Feuerstein (1979); 
several related approaches are discussed by A. L. Brown, Campione, Webber, and 
McGilly (1992) and by Lidz (1987, 1997). Because of its distinctly clinical nature, 
dynamic assessment is considered more fully in a later section of this chapter. 

Curriculum-based. measurement represents yet another set of techniques de- 
vised in order to link assessment and intervention (Deno, 1992; L. S. Fuchs, 
1993; Fuchs & Deno, 1991; Shinn, 1989). Although some curriculum-based as- 
sessment approaches can be quite informal, others involve standardized measures 
of student performance in basic skills, such as reading, spelling, and arithmetical 
computation; for a comparison of models, see Shinn, Rosenfield, and Knutson 
(1989). The common denominator of these techniques is a strongly behavioral 
orientation and a direct relationship to the tasks that make up the curriculum 
common to elementary education, as opposed to the inferential and norm- 
referenced approach of traditional psychometric instruments. Curriculum-based 
assessment has been used primarily in special education settings. 


Assessment in Early Childhood Education. Many new instruments for 
measuring the educational development of young children have been published 
in the past three’ decades. Several influences contributed to the amount and 
nature of this activity (see chaps. 9 and 12). Research in early cognitive 
development, the burgeoning of programs for preschool education, and the 
widespread concern about the effects of cultural handicaps on the child’s ability 
to profit from school instruction have all played major roles.!’ Some of the tests 
were designed principally to measure the outcomes of early childhood education 
and thus function as achievement tests. Others are presented as predictive 
instruments to assess the child's readiness for first-grade instruction. The two 
types of instruments merge imperceptibly, however, and each can usually serve 
either purpose. 

School readiness refers essentially to the attainment of prerequisite skills, knowl- 
edge, attitudes, motivations, and other appropriate behavioral traits that enable 
the learner to profit maximally from school instruction. These prerequisites are 
what J. McV. Hunt and Kirk (1974) called the “entry skills” that the child needs 
to cope with the teaching-learning situation encountered in the first grade. 
Readiness entails certain minimum levels of physical and sensorimotor develop- 
ment, achieved through maturation, as well as prior learning. Increasingly, em- 
phasis is being placed on the hierarchical development of knowledge and skills, 
whereby the acquisition of simple concepts equips the child for the learning of 


more complex concepts at any age. 


lation of a national educational goal stating that, by the year 


"Th f minated in the articul 
Wo hool ready to learn (National Council on Education Stan- 


2000, all children in the United States should start scl 
dards and Testing, 1992). 
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Readiness tests are generally administered upon school entrance. While they 
have much in common with intelligence tests for the primary grades, readiness 
tests place more emphasis on the abilities found to be important in learning to 
read. Some attention is also given to the prerequisites of numerical thinking and 
to the sensorimotor control required in learning to write. Among the specific 
functions frequently covered are visual and auditory discrimination, motor con- 
trol, aural comprehension, vocabulary, quantitative concepts, and general infor- 
mation. À widely used readiness battery is the Metropolitan Readiness Tests, 
Sixth Edition (MRT6—for reviews of an earlier edition of the MRT, see Mabry, 
1995 and Stoner, 1995). A different approach to school readiness is illustrated by 
tests that focus on the child's understanding of common relational concepts, such 
as the Boehm Test of Basic Concepts—Revised and the Bracken Basic Concept 
Scale (for reviews, see Fitzmaurice & Witt, 1989; Linn, 1989; Turco, 1989; and 
Ysseldyke, 1989). A discussion of the concept of school readiness, along with is- 
sues concerning its assessment, can be found in Gredler (1992). 


Concluding Comments. The field of education is in a state of flux, and 
educational testing reflects it, both in the United States and around the world. 
Change is likely to persist and intensify for the foreseable future.!® Many 
observers agree that there is a need to integrate assessment and instruction further 
and in such a way that these aspects of the educational enterprise complement 
each other better, to the advantage of the learner!? (H. Gardner, 1992; Nitko, 
1989). Most also recognize that both testing and teaching always will be subject 
to amelioration through theoretical and empirical advances. No single assessment 
or instructional tool will ever suffice; each has its limitations. Moreover, different 
students require different approaches. Thus, the search for improvements must, 
and will, go on. 


OCCUPATIONAL TESTING 


Psychological tests are commonly employed as aids in occupational decisions, in- 
cluding both individual counseling and institutional decisions concerning the se- 
lection and classification of personnel. In this section we discuss applications of 
testing that pertain to the assessment of the individual's occupational qualifica- 
tions as viewed by the institutions responsible for the selection, assignment, and 
evaluation of personnel. 


'8See, for example, E. L. Baker and O'Neil's (1994) volume assessing technological innovations in education 
and training, R. E. Snow and Lohman's (1989) discussion of the implications of cognitive psychology for educa- 
tional measurement, and Oakland and Hambleton's (1995) work on current international developments in aca- 
demic assessment. 


In fact, new computer-based systems incorporating instructional, assessment, and management features that 
can be individually tailored are rapidly appearing. The Learning Plus program developed by ETS for adult learn- 
ers who need to improve their basic academic skills and IBM's School Vista software for K-12 are two state-of- 
the-art examples of this kind of tools. 


Major Contexts of Current Test Use 505 


Organizations in the business and industrial sectors, in the federal, state, and 
local levels of government, and in the different branches of the armed services use 
almost every type of available test in personnel decision making. Multiple apti- 
tude batteries and special aptitude tests have often been developed for occupa- 
tional purposes, as have the situational tests described in chapter 16. Personality 
inventories (chap. 13) and biodata (chap. 16) are being used increasingly as well. 
A brief overview of the use of tests and other instruments in the selection and 
classification of personnel is provided by Landy et al. (1994); comprehensive 
treatments of the topic can be found in three recent books edited by Rumsey, 
Walker, and Harris (1994), Schmitt, Borman, et al. (1993), and Schuler, Farr, and 
Smith (1993). Major aspects of industrial/organizational use of tests are examined 
intensively in several chapters of the handbook edited by Dunnette and Hough 
(1990-1992).7? The Society for Industrial and Organizational Psychology (SIOP, 
1987) has prepared and adopted a set of principles for the validation and use of 
personnel selection procedures. While concerned with good practice in the 
choice, development, and evaluation of all personnel selection procedures, these 
principles are highly relevant to standardized tests. In addition, both the current 
Testing Standards (AERA, APA, NCME, 1985) and their proposed revision (see 
chap. 1) have a chapter on the topic of testing in employment settings. Another 
major application of occupational testing—also covered in both versions of the 
Testing Standards —is in the licensing and certification of persons deemed qualified 
to practice in any of a large number of trades and professions.”! 

As in our treatment of educational tests, in this section we shall concentrate 
on those tests specially designed for occupational purposes, over and above the 
more widely applicable instruments discussed in other chapters. We shall also ex- 
amine briefly some of the procedures and issues involved in assembling and vali- 
dating tests in employment settings. 


Validation of Employment Tests 


From the standpoint of both employee and employer, it is obviously of prime impor- 
tance that individuals be placed in jobs for which they have appropriate qualifica- 
tions. Effective placement also implies that traits irrelevant to the requirements of 
the particular job should not affect selection decisions, either favorably or unfavor- 
ably. If a mechanical ability test requires a much higher level of reading comprehen- 
sion than does the job, its use would not lead to the most effective utilization of 
personnel for that job. The simple psychometric fact that test validity must be as- 
certained for particular uses of the test has long been familiar. It has acquired new 


20The latest volume of the handbook—edited by Triandis, Dunnette, and Hough (1994)— deals with industrial/ 
organizational psychology issues in various cultures from around the world. 

?' Testing in the professions, including testing for certification and licensure of psychologists, is discussed by 
Anastasi (1988b, pp. 468-474). For a more recent examination of issues pertaining to the validation and use of 
licensure and certification exams in general, see the special issue of Evaluation & the Health Professions edited by 


LaDuca (1994) on that subject. 
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urgency because of the widespread concern about the job placement of culturally 
and educationally disadvantaged minorities (see chap. 18). An invalid test or one 
that includes elements not related to the job under consideration may unfairly ex- 
clude minority group members who could perform the job satisfactorily. 

Another relevant concern, both for organizations and for society at large, 
stems from the demonstrated relationship between job productivity and the valid- 
ity of selection instruments. Procedures for assessing this relation and typical re- 
sults were cited in chapter 6. The estimated gains and losses in productivity 
associated with rises and drops in the validity of personnel selection procedures 
are substantial. In organizations hiring many employees, such as government 
agencies, the cumulative value of such gains and losses is so large that it deserves 
close attention. 

For several decades, the prevalent opinion in personnel psychology was that se- 
lection tests should undergo full-scale validation against local criteria of job per- 
formance. Specific procedures for such criterion-prediction validation were 
discussed in chapters 5 and 6. However, a full-scale longitudinal validation study 
is unrealistic in the large majority of situations. Even under unusually favorable 
conditions, with access to large employee samples, several practical limitations 
become apparent (see, e.g., Anastasi, 1972; J. T. Campbell, Crooks, Mahoney, & 
Rock, 1973). In view of the practical difficulties in conducting full-scale local 
Sege validation, a number of alternative procedures have been ex- 
plored. 


Global Procedures for the Assessment of Performance. One approach 
to personnel selection uses assessment procedures that resemble the total job 
situation as closely as possible. This resemblance, however, can never be 
complete. A probationary appointment comes closest to being a true replica of the 
job. But even in this case, the shortness of the period and the knowledge that the 
appointment is probationary may influence worker behavior in a number of ways. 
Job samples represent another attempt to approximate actual job performance. In 
the job sample, the task is actually a part of the work to be performed on the job, 
but the task and working conditions are uniform for all applicants. Some job- 
sample tests are custom-made to fit specific jobs. The representativeness of the 
work sample and the closeness with which the task duplicates job conditions are 
essential considerations. Familiar examples are driving tests, as well as 
standardized tests for office skills, such as typing and operating various business 
machines. 

Some tests employ simulation to reproduce the functions performed on the job. 
Simulations merge imperceptibly with job samples. Examples range from the op- 
eration of a miniature punch press to simulators for locomotive engineers and for 
airplane pilots. Simulators have been used for both training and testing purposes 
in the United States space program administered by NASA, as well as in a num- 
ber of military specialties. 

To this list may be added assessment center techniques (see chap. 16), which 
have been used largely in evaluating managerial or administrative personnel 
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(Bray, 1982; Finkle, 1983; Moses, 1985; Thornton & Byham, 1982). A distinctive 
feature of this approach is the inclusion of situational tests, such as the in-basket, a 
technique adapted for testing executives in many contexts (N. Frederiksen, 1962, 
1966; Shapira & Dunbar, 1980). Simulating the familiar “in-basket” found on the 
administrator's desk, this test provides a carefully prepared set of incoming letters, 
memoranda, reports, papers to be signed, and similar items. Before taking the test, 
the examinee has an opportunity to study background materials for orientation 
and information regarding the hypothetical job. During the test proper, the task is 
to handle all the matters in the in-basket as the examinee would on the job. All 
actions must be recorded in writing but may include letters, memos, decisions, 
plans, directives, information to be obtained or transmitted, agenda for meetings, 
or any other notes. Other assessment center techniques may employ roleplaying, 
group problem-solving, and business games. A common feature is the use of mul- 
tiple assessors and peer ratings. Many of the traits evaluated pertain to motiva- 
tion, interpersonal skills, and other personality variables. 

Although depending at least in part on job resemblance as evidence of their 
“job relatedness,” these global performance assessment procedures have also been 
evaluated, both singly and in various combinations, against a variety of criteria 
(see Landy et al., 1994; Schmidt, Ones, & Hunter, 1992). 


Job Analysis and the Job Element Method. There is a growing interest in 
the application of content validation to personnel selection tests. In all its forms, 
this validation depends on a thorough and systematic job analysis (McCormick, 
1979). To be effective, a job analysis must identify the requirements that 
differentiate a particular job from other jobs. A description in terms of vague 
generalities that would be equally applicable to most jobs is of little use for this 
purpose. To obtain a well rounded picture of job activities, the job analyst may 
draw upon several sources of information. Published training and operating 
manuals, performance records, and, especially, subject-matter experts—such as 
supervisors, instructors, Or experienced workers in a field—are frequently 
consulted. 

An effective job analysis also should concentrate on those aspects of perfor- 
mance that differentiate most sharply between the better and the poorer workers. 
In his classic book on Aptitude Testing, Hull (1928) stressed the importance of 
these differentiating aspects of job performance. Later, this concept was reempha- 
sized by J. C. Flanagan (1949, 1954), who proposed the critical incident technique. 
Essentially, this technique calls for factual descriptions of specific instances of job 
behavior that are characteristic of either satisfactory or unsatisfactory workers. 

The focus on critical job requirements led to the development of the job ele- 
ment method for constructing tests and demonstrating their content validity 
(McCormick, 1979, 1983; McCormick, Jeanneret, & Mecham, 1972; Primoff, 
1975; Primoff & Eyde, 1988). This method was fully developed and widely em- 
ployed by Primoff and his associates at the U.S. Office of Personnel Management 
(formerly the U.S. Civil Service Commission). Essentially, job elements are the 
units describing critical work requirements. Although various adaptations of the 
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job element method differ in the details of procedure, all provide for the descrip- 
tion of job activities in terms of specific behavioral requirements, from which test 
items can be directly formulated. The specific behavioral statements can, in turn, 
be grouped under broader categories, or constructs—such as computational accu- 
racy, dexterity of hands and arms, visual discrimination, or ability to work under 
pressure. There is a growing body of research aimed at the development of a gen- 
eral taxonomy of job performance in terms of broad behavioral constructs (Fleish- 
man, 1975; Fleishman & Quaintance, 1984; Fleishman & Reilly, 1992b). 

Job analysis methods can contribute to facilitating the effective use of tests 
across many superficially different jobs. This is illustrated by instruments such as 
the Fleishman Job Analysis Survey (F-JAS) and the Work Keys system. The 
F-JAS is a job analysis tool designed to describe jobs in terms of the knowledge, 
skills, and abilities required to perform them. Fifty-two out of its 72 scales cover 
carefully defined abilities in the cognitive, psychomotor, physical, and sensory/ 
perceptual domains, and most of these have been linked to existing tests (Fleish- 
man & Mumford, 1991; Fleishman & Reilly, 1992a, 1992b). The remaining 20 
scales deal with the Interpersonal/Social and Knowledge/Skills domains and are 
still under development. On the other hand, the Work Keys system—recently 
devised by the American College Testing Program (ACT, 1995; Scruggs, 1994)— 
concentrates on a much smaller set of generic workplace skills, such as "Locating 
Information," that can be taught in a reasonable period of time. Within this frame- 
work, however, the system provides a coordinated package of tools for: (1) job 
analysis and profiling; (2) assessment of skill levels; (3) feedback to individuals, 
educators, and employers; and (4) instructional support in the implementation of 
training or educational curricula. 

Job analysis is one of the oldest and most viable methods developed in indus- 
trial psychology. Its application in validating employment tests has kept expand- 
ing as advances in computer technology have made the collection and analysis of 
job data more economical.”* In addition, information gathered through careful 
job analyses can be put to many other uses, such as establishing the value of a job 
in the marketplace or designing jobs (see, e.g., Campion, 1994; Goldstein, 
Zedeck, & Schneider, 1993). 


The Prediction of Job Performance. Practical difficulties inherent in the 
conduct of local criterion-prediction validation (see chaps. 5 and 6) have led to a 
relative dearth of such studies. Still, the fact remains that many organizations 
require forecasts of future job performance in order to make personnel selection 
and placement decisions.? For these purposes, the techniques of synthetic 
validation and validity generalization increasingly have become the alternatives 


7 An overview of advances and problems in job analysis methodology can be found in Harvey (1991). Knapp, 
Russell, and Campbell (1993) describe specific applications of job analysis in the context of selection and classi- 
fication of armed services personnel. 

23For a provocative discussion about the limitations of prediction as a paradigm in personnel selection, see 
de Wolff (1993). 
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of choice. Both allow estimates of the validity of a test for a particular job to be 
made in the absence of local validation. And, as empirical data accumulate, both 
methods should provide convergent evidence about the nature of the constructs 
assessed (J. P. Campbell, 19902). 

The concept of synthetic validation is based on the job element method premise 
that it is possible to identify skills, knowledge, and other performance require- 
ments common to many different jobs. Synthetic validity has been defined as “the 
inferring of validity in a specific situation from a systematic analysis of job ele- 
ments, a. determination of test validity for these elements, and a combination of 
elemental validities into a whole" (Balma, 1959, p. 395). Essentially, this tech- 
nique involves three steps: (1) detailed job analysis to identify the job elements 
and their relative weights in a particular job; (2) analysis and empirical study of 
each test to determine the extent to which it measures proficiency in performing 
each of these job elements; and (3) finding the validity of each test for the given 
job synthetically from the weights of these elements in the job and in the test. A 
statistical procedure for computing this validity was developed by Primoff (1959; 
Primoff & Eyde, 1988). Designated as the J-coefficient (for job coefficient), this 
procedure is essentially an adaptation of multiple regression equations, discussed 
in chapter 6. Other approaches to synthetic validation have been described by 
J. P. Hollenbeck and Whitemer (1988) and by Mossholder and Arvey (1984). 

Validity. generalization procedures—originally developed by Schmidt and 
Hunter (1977) and described in chapter 5— provide another avenue for validat- 
ing personnel selection tests. Basically, this approach permits the application of 
prior validity findings to a new situation through meta-analytic techniques 
(Schmidt, Hunter, Pearlman, & Hirsh, 1985). The actual extent to which find- 
ings obtained through meta-analyses are generalizable has been called into ques- 
tion by some. Critics have underscored the differences that exist across job 
situations as well as methodological problems in estimating parameters. This, in 
turn, has led to refinements in meta-analytic methods and, subsequently, to their 
increased acceptance and application. Although some controversial points re- 
main, and there is still room for further improvements, it is undeniable that valid- 
ity generalization methods have contributed greatly to the vitality of theory, 
research, and practice in occupational testing (see, e.g., L. R. James et al., 1992; 
Landy et al., 1994; Schmidt et al., 1993). 


The Criterion of Job Performance, Some of the most promising work in the 
field of personnel selection and classification stems from the renewed attention 
that is being given to criteria. It may be recalled, from chapter 5, that there is a 
vast array of indices that can be viewed as criterion measures, depending on how 
the criterion is defined. Nevertheless, within each validation study, a single 
convenient measure of job performance typically has been used to represent “the” 
criterion, regardless of the purpose of the prediction process. Until recently, 
in spite of repeated calls for more careful consideration of criteria, extending 
over several decades (see, e.g., L. R. James, 1973; Tenopyr, 1986; Wallace, 1965), 


little was done about this crucial problem. However, in the past few years, 
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several investigators have been working toward a clearer conceptualization of job 
performance and a better understanding of its determinants (Borman, 1991; 
Campbell, McCloy, Oppler, & Sager, 1993; B. E Green & Wigdor, 1991; Schmidt 
& Hunter, 1992). 

One new model of job performance which promises to play an important 
heuristic role is the multiple factor theory that John P. Campbell and his cowork- 
ers have been developing in conjunction with the U.S. Army Selection and Clas- 
sification Project (Project A—J. P. Campbell 1990a, 1990b, 1994; Campbell, 
McHenry, & Wise, 1990). Campbell's model takes into account the multifaceted 
nature of job performance and separates the various elements subsumed under 
that rubric. Initially, the model makes some fundamental distinctions between 
those aspects of work evaluation that are under the control of the worker—for ex- 
ample, the behaviors involved in job performance itself—and those that are 
not—for example, the consequences of job performance (effectiveness), its rela- 
tive costs (productivity), and the value placed on each of these aspects by the or- 
ganization (utility). With regard to job performance itself, the theory postulates 
that any job entails multiple performance components (tasks) and that the deter- 
minants of each component consist of various combinations of knowledge, skill, 
and motivational elements within the worker. Furthermore, each determinant of 
job performance has certain more or less specifiable antecedents—such as train- 
ing, reinforcement contingencies, and individual traits—that can influence the 
performance indirectly through their effects on the level of knowledge, skills, and 
motivation of the individual. In addition, these determinants of job performance 
interact with one another, with a consequent impact on performance. 

Although the multiple factor theory of job performance is still evolving (J. P. 
Campbell, 19902, 1994; D. J. Knapp & Campbell, 1993), its design is quite com- 
patible with other major conceptual and methodological advances in the evalua- 
tion of job behavior (Borman, 1991). The model presently identifies eight general 
factors of job performance, including such characteristics as consistency of effort, 
personal discipline, leadership, job-specific task proficiency, and other kinds of 
task proficiencies. These factors are presumed to be broad enough to encompass 
the main elements needed to describe all the jobs listed in the Dictionary of Occu- 
pational Titles. In addition, the model specifies three kinds of determinants of in- 
dividual differences in job performance—namely, declarative knowledge, proce- 
dural knowledge and skill, and motivation—as well as their antecedents. This 
comprehensive and clearly articulated theoretical structure should prove to be ap- 
plicable in a wide range of research on job performance constructs. 


Occupational Use of Tests 


As the preceding discussion suggests, the appropriateness of test use in personnel 
decisions cannot be considered apart from the specific purposes, situations, and 
populations involved in a given context.”* It also should be noted that—whereas 


24For a brief look at the multiplicity of variables that impinge on the success of the person-context interaction, 
see Sternberg (19942). 
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we may categorize tests by types for the sake of discussion—in practice, the de- 
marcation lines between knowledge, abilities, skills, and personality traits are not 
always clear. Thus, it may be more fruitful to think about work behavior as deter- 
mined by response capabilities, as some have suggested (Lubinski & Dawis, 1992). 
Moreover, although the validity of a test is often reviewed in isolation, tests are 
hardly ever used by themselves. Most test-based personnel decisions employ a 
combination of one or more measures as well as some other kind of assessment 
tool—such as an interview or background data.?> Keeping all of these caveats in 
mind, we now turn to consider the use of tests in occupational settings. 


The Role of Academic Intelligence. “Intelligence” is a broad term, with 
many definitions. What constitutes intelligence undoubtedly varies in different 
cultures, during different historical epochs, and at different life stages (see chaps. 
11 and 12). Traditional intelligence tests, in contrast, cover a more narrowly 
limited, identifiable cluster of cognitive skills and knowledge which, nevertheless, 
has proved to be widely predictive of performance in both academic and 
occupational activities demanded in modern technological societies. Because 
it deals largely with knowledge and skills developed in the course of formal 
schooling in such societies, this ability cluster is frequently described as academic 
intelligence or scholastic ability. Its content includes principally verbal com- 
prehension, quantitative reasoning, and other aspects of abstract thinking. 

It is well known that performance on tests of academic intelligence correlates 
substantially with amount of education. It would thus seem that educational re- 
quirements could be established to cover the applicant's qualifications in this im- 
portant cluster of cognitive skills and knowledge. There are difficulties, however, 
in the way of this solution. Amount of education is an indirect index of the indi- 
vidual’s cognitive developmental status, and the correlation between the two is 
far from perfect. Mere exposure to formal schooling does not ensure equal learn- 
ing of what has been taught; moreover, the knowledge and skills normally devel- 
oped through schooling can be acquired in other ways. It is therefore fairer to the 
individual to test her or his knowledge and cognitive skills rather than to accept 
or reject applicants on the basis of amount of formal education. 

Among commercially published instruments, several short tests of academic 
intelligence have been specially developed for use in industry. An example is 
the Wonderlic Personnel Test (Wonderlic Personnel Test, Inc., 1992). Begun as 
a revision of an early group intelligence test (the Otis Self-Administering Tests 
of Mental Ability), the Wonderlic is a 50-item, 12-minute test. It includes a 
variety of item types with verbal, numerical, and some spatial content—pre- 
sented in a spiral-omnibus format—and yields a single score. Available in many 
forms, this test has accumulated extensive norms on various occupational 
groups over its several decades of use, and its predictive value for training and 
job success is well documented (for reviews, see Belcher, 1992; Schmidt, 1985; 


Schoenfeldt, 1985). 


25See Guion (1991) for an excellent overview of the process of personnel assessment, selection, and placement, 


including findings concerning the validity and fairness of various tests and other types of predictors. 
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Interest in the potential usefulness of general academic intelligence tests for 
personnel selection has been reawakened by research on validity generalization 
(see, e.g., Hunter, 1986). Of particular relevance are the findings that tests of ver- 
bal and numerical reasoning have some predictive validity for a wide variety and 
range of jobs. In addition, the validity rises for jobs with more extensive demands 
for decision making and information processing. However, although tests of gen- 
eral cognitive ability contribute substantially to the prediction of job perfor- 
mance, especially for complex jobs, predictive accuracy may be enhanced by the 
assessment of additional variables. Chief among these are the more specialized 
skills and knowledge required by particular jobs—including psychomotor skills 
and tacit or procedural knowledge—as well as noncognitive variables, such as 
temperamental and attitudinal characteristics (see, e.g., Ackerman, 1992; J. P. 
Campbell, 1990b; Carroll, 1992; Kanfer, Ackerman, Murtha, & Goff, 1995; 
Sternberg, Wagner, Williams, & Horvath, 1995). Many of these relationships 
have been demonstrated in large-scale research. projects with the classification 
batteries employed in the armed services and in certain civilian government 
agencies. 


Aptitude Batteries for Special Programs.?? The General Aptitude Test 
Battery (GATB) was developed by the United States Employment Service 
(USES) for use by employment counselors in state employment service offices 
(U.S. Department of Labor, 1970). In addition, the battery may be obtained by 
nonprofit organizations such as secondary schools, colleges, and prisons. At 
present, the GATB includes 12 tests; 4 of the tests require simple apparatus, while 
the other 8 are paper-and-pencil tests. The entire battery can be administered in 
approximately 2/4 hours. It yields scores on nine factors and on three major 
composite measures derived from the factors, all of which are listed in Table 17-1. 

GATB score utilization can proceed through two distinct approaches. The first 
makes use of multiple-cutoff scores on the most significant aptitudes required for 
relatively homogeneous groups of jobs. One mechanism used with this approach 
is the Occupational Aptitude Pattern (OAP) structure developed in the 1970s 


General Aptitude Test Battery (GATB) Factors and Composites 


Factors — : or Tcu 
fem Gieneral Learning Ability S. Spatial Aptitude K. Motor Coordination 
V. Verbal Aptitude $ P. Form Perception E Finger Dexterity 
N. Numerical Aptitude : E Clerical Perception ` M. Manual Dexterity 
Composites į e rae E PEN EE SEH 
Cognitive =G+V+N Perceptual =S+P+Q Psychomotor =K + F+M 


26See chapters 10 and 11 for background pertinent to this section. 
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(U.S. Department of Labor, 1979, 1980). OAPs for more than 60 job families, 
covering thousands of occupations, have been prepared. For each job group, cut- 
off scores denoting high, medium, and low levels of qualification on the pertinent 
aptitudes have been calculated and can be employed to counsel individuals about 
suitable careers./ A second approach to the use of the GATB was derived 
through the application of validity generalization (VG) techniques to data from 
more than 500 prior USES validity studies. This procedure, which came to be 
known as the VG-GATB, uses validity estimates based on appropriate combina- 
tions of scores for all jobs within each job family (U.S. Department of Labor, 
1983a, 1983c, 1983d). Predictions are based on three composite measures— 
cognitive, perceptual, and psychomotor—derived from the original factor scores. 
Of the three, the cognitive composite yields the highest validity coefficients for 
most jobs, but the psychomotor composite can enhance prediction as jobs de- 
crease in complexity (Hunter & Hunter, 1984). 

Through the facilities of the USES, an extensive body of data has been accu- 
mulated on the GATB, much of which attests to its exceptional reliability and 
substantial predictive validity (for reviews, see B. Bolton, 1994; Kirnan & 
Geisinger, 1986). However, the practice of subgroup norming—implemented in 
the 1980s with the VG-GATB in order to ensure comparable rates of referrals for 
White, Black, and Hispanic applicants—led to a heated political debate about 
fairness in preemployment testing (Hartigan & Wigdor, 1989). The debate culmi- 
nated with the passage of the Civil Rights Act of 1991 (BL. 102-166), which pro- 
hibited subgroup norming. This legislation has affected the use of the battery and 
has made its future uncertain (L. S. Gottfredson, 1994; Wigdor & Sackett, 
1993.—also see chap. 18). Nevertheless, a research program that includes the de- 
velopment of two new forms and an experimental computerized adaptive version 
of the GATB is still in progress. 

Another major selection and classification tool is the Armed Services Voca- 
tional Aptitude Battery (ASVAB), developed jointly for use in all the U.S. armed 
services (Bayroff & Fuchs, 1970). The battery is administered to high school stu- 
dents interested in military occupations and to individuals who have applied to 
enter the military. Current forms of the ASVAB include ten subtests listed in 
Table 17-2.28 The Armed Forces Qualification Test (AFQT) is a composite score 
used by all the services as a gauge of general trainability in screening potential re- 
cruits. In addition, each of the services combines subtests to form composites for 
its own personnel selection and classification needs. For example, the Army's 
Combat composite consists of the AR + CS + AS + MC component subtests. 


27The Special Aptitude Test Battery (SATB) groups provide an alternative mechanism for the use of cutoff 
scores with the GATB. The multiple-cutoff strategy is more fully discussed in chapter 6. In the realm of selec- 
tion, as opposed to counseling, the most appropriate use of cutoff scores is for a preliminary screening of appli- 


cants in one or more critical skills. 

28There is also a Computerized Adaptive Testing version of the ASVAB (CAT-ASVAB), which has been under 
development since 1979 and now is being used operationally for some military entrance processing (T. L. Russell, 
Reynolds, & Campbell, 1994). For a description of the CAT-ASVAB and its development, see Wiskoff & 


Schratz (1989). 
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Armed Services Vocational Aptitude Battery (ASVAB) 


ASVAB Subtests 


General Science (GS) |... _ Arithmetic Reasoning (AR) 
Word Knowledge (WK)!  — -Mathematics Knowledge (MK)* 
Paragraph Comprehension PENE Sr ` Mechanical Comprehension (MC) 
Electronics Information (El). ` Auto and Shop Information (AS) 
Co ding Speed (CS) a ES An ES jg Num erical Operations (NO)? 


L 


*AFQT composite bSpeeded 


Standard scores for the ASVAB are based on norms obtained from a representa- 
tive sample of American youths (U.S. Department of Defense, 1982). The factor 
structure of the battery has been examined at some length. Findings typically 
show a general factor that accounts for about 60 percent of the total ASVAB vari- 
ance and four group factors that have been repeatedly replicated (J. R. Welsh, 
Watson, & Ree, 1990). The four factors, along with the subtests on which they 
have the highest loadings, are: (1) Verbal (WK and PC); (2) Speed (NO and 
CS); (3) Quantitative (AR and MK ); and (4) Technical (AS, MC, and EI). 

The validities of the ASVAB subtests and composites have been investigated 
against a variety of job and educational performance criteria. As might be imag- 
ined, validity coefficients differ substantially depending on the type and number 
of criteria employed. In general, validities are higher for “can-do” criteria—such 
as general soldiering and technical proficiency—than for “will-do” criteria—such 
as effort, leadership, and personal discipline. As might be expected, the former are 
assessed through job knowledge and “hands-on” measures, whereas the latter typ- 
ically are assessed through supervisor, peer, and self ratings (McHenry, Hough, 
Toquam, Hanson, & Ashworth, 1990; T. L. Russell et al., 1994). 

The most extensive examination of the ASVAB was initiated by the U.S. De- 
partment of Defense in 1980, as a part of the Joint-Service Job Performance Mea- 
surement/Enlistment Standards (JPM) Project (Wigdor & Green, 1991a, 1991b). 
This massive project is aimed at developing robust measures of performance for 
entry level military jobs so that meaningful and valid enlistment standards can be 
set for the all-volunteer armed forces. The first phase of the JPM established that 
the ASVAB is a good predictor of high-fidelity, performance-based indicators of 
job proficiency.” It also provided support for the differential validity of the 
ASVAB composite scores for different jobs. However, the magnitude of the aver- 
age score differences between Black and nonminority examinees was considerably 
greater on the AFQT and on paper-and-pencil tests of job knowledge than on the 
hands-on job sample tests. This suggests the possibility that some of the ASVAB 


2°The conceptual and methodological work that went into developing performance-based measures is a valuable 
contribution of the JPM project which should prove applicable in the field of educational assessment and 
elsewhere. 
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measures may overestimate the size of the actual group differences in job perfor- 
mance. If corroborated, this situation would parallel some of the findings obtained 
with the GATB (Hartigan & Wigdor, 1989). Furthermore, the validity coeffi- 
cients of the ASVAB against various criteria were modest enough to justify a 
search for additional predictors. The second phase of the JPM project is exploring 
models for enlistment standards that will enhance the overall utility of selection 
and classification decisions, both in terms of costs and levels of performance. 

The U.S. Army’s Selection and Classification Project (Project A) encompasses 
another important segment of the research on the ASVAB and on new predictors 
of military job performance. Project A originated as the Army’s own extensive re- 
sponse to the mandate of the JPM project and is arguably the largest and most ex- 
pensive personnel selection research project ever (Schmidt et al., 1992). In 
addition to its contribution to building a theory of job performance—discussed in 
an earlier section of this chapter—the work of Project A. has involved develop- 
ment and evaluation of many new predictors that go beyond the traditional cogni- 
tive functions of the ASVAB. The evolving battery includes computer-administered 
perceptual arid psychomotor tests; it also involves specially constructed invento- 
ries designed to assess personality, temperamental, and background variables as 
well as interests (McHenry et al., 1990; N.G. Peterson et al., 1990). Moreover, the 
large scale and longitudinal nature of Project A have also made it possible to study 
validity changes over time to an unprecedented extent (J. P. Campbell, 1990b).7° 


Special Aptitude Tests. Even prior to the development of multiple aptitude 
batteries, many people recognized that intelligence tests were limited in their 
coverage of abilities. Efforts were soon made to fill the major gaps by means of 
special aptitude tests covering the more concrete and practical abilities, such as 
mechanical’ aptitude. The demands of occupational selection and counseling 
likewise stimulated the development of measures of spatial, clerical, musical, and 
artistic aptitudes. Tests of vision, hearing, muscular performance, and motor 
dexterity have also been widely used in selecting and classifying personnel for 
industrial and milita es. ^ 

A word should gei ibl the concept of special aptitudes. The term origi- 
nated at a time when the major emphasis in testing was placed on general intelli- 
gence. Mechanical, musical, and other special aptitudes were thus regarded as 
supplementary to the “IQ” in the description of the individual. With the advent 
of factor analysis, however, it was gradually recognized that intelligence itself 
comprises a number of relatively independent aptitudes, such as verbal compre- 
hension, numerical reasoning, spati visualization, and the like. Moreover, sev- 
eral of the traditional special aptitudes, such as mechanical and clerical, are now 
incorporated in some of the multiple aptitude batteries. 


Eege 
304 final report on Project A research, in the form of a volume co-edited by J. P. Campbell and D. Knapp, is 
scheduled for publication in the late 1990s. ; 

31 An overview of personnel selection procedures specifically for jobs that are physically demanding is presented 


by J. C. Hogan (1992). 


516. APPLICATIONS OF TESTING 


What, then, is the role of special aptitude tests? First, there are certain. areas, 
such as vision, hearing, motor dexterity, and artistic talents, that are rarely in- 
cluded in multiple aptitude batteries. The situations requiring tests in these areas 
are too specialized to justify the inclusion of such tests in standard batteries. Spe- 
cial aptitude tests are also employed, however, in areas that are covered in multi- 
ple aptitude batteries, such as clerical and mechanical aptitudes. In several testing 
programs, tests of academic intelligence are combined with specially selected tests 
of other relevant aptitudes. One reason for this practice is to be found in the ex- 
tensive normative and validation data available for some widely used tests of spe- 
cial aptitudes. Another reason, undoubtedly, is the flexibility that this procedure 
provides, not only in the choice of relevant aptitudes but also in the fullness with. 
which each aptitude is measured for specific purposes. 

Many tests have been devised to measure speed, coordination, and other psy- 
chomotor skills. Most. are concerned with. manual dexterity, but a few involve leg 
or foot movements that may be required in performing specific jobs. Some mea- 
sure a combination of motor and perceptual, spatial, or mechanical aptitudes. 
The principal application of these tests has been in the selection of industrial and 
military personnel. Psychomotor tests are characteristically apparatus tests, al- 
though several paper-and-pencil adaptations have been designed for group ad- 
ministration. An. example of a published instrument requiring several. simple 
manipulative skills is the Crawford Small Parts Dexterity Test (Crawford & 
Crawford, 1981), shown in Figure 17-2. In Part I of this test, the examinee uses 
tweezers to insert pins into close-fitting holes and then places a small collar over 
each pin. In Part II, small screws are placed in threaded holes and screwed down 
with a screwdriver. The score is the time required to complete each part. 

-. What can be said about the effectiveness of psychomotor tests as a whole? The 
most important point to note in evaluating such tests is the high degree of speci- 
ficity of motor functions. Intercorrelations and factor analyses of large numbers of 
motor tests have failed to reveal broad group factors such as those found for intel- 


Figure 17-2. Crawford Small Parts Dexterity Test. 
(Courtesy of The Psychological Corporation.) 
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lectual functions (Fleishman, 1975; Fleishman & Quaintance, 1984, chap. 12). In 
considering the validity of psychomotor tests, we need to differentiate between 
complex motor tests that closely resemble the particular criterion performance 
they are trying to predict and tests of simple motor functions designed for more 
general use. The former are custom-made tests—now typically computerized— 
that reproduce the combination of motor aptitudes required by the criterion and 
have shown fair validity. Some Air Force tests, for example, have been shown to 
improve prediction of flying performance (see, e.g., RH. Cox, 1989; Kantor & 
Carretta, 1988). For most purposes, however, the use of such tests is not practica- 
ble, since a very large number of tests would have to be devised to match different 
criteria. With regard to commercially available motor tests, the functions they 
measure are very simple, and their validities against most criteria are not high. For 
this reason, such tests can serve best as part of a selection battery, rather than as 
single predictors. 

Mechanical aptitude tests cover a.variety of functions. Psychomotor factors enter 
into some of the tests in this-category, either because the rapid manipulation of 
materials is required in the performance of the test or because special subtests de- 
signed to measure motor dexterity are included in a paper-and-pencil test. Percep- 
tual and spatial- aptitudes also play an important part in many of these tests. 
Finally, mechanical reasoning and sheer mechanical information predominate in 
a number of mechanical aptitude tests. sa 2 i 

It is important to recognize the diversity of functions subsumed under the 
heading of mechanical aptitude, since each function may be differently related to 
other variables. Forexample; mechanical information tests are much more depen- 
dent on past experience with mechanical objects than are abstract spatial or per- 
ceptual tests. Similarly, gender differences may be reversed from one of these 
functions to another. Thus, in manual dexterity and in perceptual discrimination 
tests, women generally excel; in abstract spatial tests, a small but significant aver- 
age difference in favor of males is usually found; while in mechanical reasoning 
or information tests, men are markedly superior (Anastasi, 1981c; Hedges & 
Nowell, 1995). j ue i a 

Among the aptitudes included in all multiple aptitude batteries is spatial apti- 
tude. This is the ability measured by the Space Relations test of the DAT 
(see chap. 10); it has been found to have a high loading in many performance and 
nonlanguage tests of general intelligence. One of the best single measures of this 
aptitude is the: Revised Minnesota Paper Form Board Test (Likert & Quasha, 
1995). Another major type of mechanical aptitude test is concerned with mechan- 
ical information, mechanical reasoning, Or mechanical comprehension. Although 
they require some familiarity with common tools and mechanical relations, these 
tests assume no more technical knowledge than can be acquired through every- 
day experience in a modern industrialized society. Some of the early tests in b 
field required the examinee to assemble common mechanical objects from the 
given parts. For general testing purposes paperand-pencil group ip x^ now 
widely employed. A well-known example of this type of test ds the sinas 
Mechanical Comprehension Test (G. K. Bennett, 1994). Utilizing pictures about 
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x 


Which man carries more weight? 
(If equal, mark C.) 


Y 


Which letter shows the seat where 
a passenger will get the smooth- 
est ride? 


Figure 17-3. Sample Items from the Bennett Mechanical Comprehension Test. 
Answers are recorded on a separate answer sheet. 


(Reproduced by permission. Copyright © 1967-1970, 1994 by The Psychological Corporation. All rights re- 
served.) 


which short questions are to be answered—such as those illustrated in Figure 
17-3—this test emphasizes the understanding of mechanical principles as applied 
to a wide variety of everyday life situations. 

Tests designed to measure clerical aptitudes are characterized by a common em- 
phasis on perceptual speed and accuracy. A typical example is the Minnesota 
Clerical Test (MCT—Andrew, Paterson, & Longstaff, 1979), which consists of 
two separately timed subtests, Number Comparison and Name Comparison. In 
the first, the test taker is given 200 pairs of numbers, each containing from 3 to 12 
digits. If the two numbers in the pair are identical, a check mark is placed be- 
tween them. The task is similar in the second subtest, proper names being substi- 
tuted for numbers. 

Such relatively homogeneous tests as the MCT measure only one aspect of 
clerical work. Clerical jobs cover a multiplicity of functions. Moreover, the num- 
ber and particular combination of duties vary tremendously with the type and 
level of job. Despite such a diversity of activities, however, job analyses of clerical 
work generally indicate that a relatively large proportion of time is spent in tasks 
requiring speed and accuracy in perceiving details. To be sure, many types of jobs 
besides that of clerk require perceptual speed and accuracy. Inspectors, checkers, 
packers, and a host of other factory workers obviously need this ability, though 
many of these jobs are being replaced by electronic scanning devices. 


Major Contexts of Current Test Use 519 


Several tests of clerical aptitude combine perceptual speed and accuracy with 
other functions required for clerical work. Among the tools used are job-sample tests 
for such activities as alphabetizing, classifying, coding, and the like. In addition, 
some measure of verbal and numerical ability may be included, to serve in lieu of a 
general intelligence test. Other clerical aptitude tests include such office skills as 
business vocabulary, business information, spelling, and language usage. Some cleri- 
cal tests are more properly classified as achievement or job sample tests, since they 
measure skills acquired after the completion of specialized training. Examples in- 
clude stenographic and typing skills tests, as well as tests on data entry and data re- 
trieval skills like the CRT Skills Test by Science Research Associates (SRA, 1990). 

The rapid growth in the use of computers for office work led to the publication 
of several tests for computer-related aptitudes. Many of these were designed for the 
counseling or selection of potential trainees. Examples include the Computer Lit- 
eracy and Computer Science Tests and the Computer Programmer Aptitude Bat- 
tery (for reviews of these instruments, see Mahurin, 1992; Marco, 1992; Schafer, 
1992). Although these tests represented timely applications of psychometrics to 
personnel assessment at the time they were developed—between the 1960s and 
the early 1980s—the pace of technological change is so fast that some of them al- 
ready have become obsolete. Gradually, newer instruments are becoming avail- 
able, both for instruction and for competency testing in the use of specific 
software programs, such as dBASE, WordPerfect, and Lotus 1-2-3.? 

Recent strides in recognizing the significance of the cognitive dimensions of 
interpersonal and intrapersonal functioning (see, e.g., H. Gardner, 1983; Salovey 
& Mayer, 1990) are likely to stimulate the development of standardized objective 
instruments aimed at assessing the social and emotional aspects of intelligence in the 
context of work. Such instruments have been heretofore practically nonexistent, 
as the adequacy of interpersonal and intrapersonal functioning in employment 
settings has traditionally been gauged either through personality tests or through 
interviews and assessment center techniques. The Teamwork-KSA Test, a re- 
cently published paper-and-pencil measure designed to predict effective team- 
work, is one of the first instruments within this new category. Based on a review of 
the literature of work groups in organizations, the test attempts to measure the 
knowledge, skills, and abilities of examinees in the interpersonal and self- 
management domains: The multiple-choice items present hypothetical questions 
dealing with conflict resolution, communication, and collaborative problem solv- 
ing, as well as with. goal setting, planning, and other self-management skills 
(M. J. Stevens & Campion, 1994). Further work and experimentation with this 
and other similar instruments are sure to follow. 


Personality Testing in the Workplace 


In the mid-1980s, Bernardin and Bownas (1985) noted that the use of personality 
evaluation techniques—including such unscientific methods as handwriting 


uiid psc ROME MEN 
32These tests, and several other computer-related instruments, are available from the SRA Product Group of 


McGraw-Hill/London House (see Appendix B). 
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analysis—was flourishing in industrial settings, while the academic community 
had largely neglected the subject for almost two decades.” Since then, à tremen- 
dous resurgence of work in the area has occurred, prompted by methodological 
and theoretical developments. From a methodological viewpoint, the application 
of meta-analysis and the use of causal modeling techniques (see chap. 5) have 
stimulated research on the noncognitive traits that can affect job performance. 
Meta-analyses have been employed primarily to investigate the validity and util- 
ity of personality constructs across settings. Path analysis and structural equation 
modeling techniques (see chap. 5) are used to study the interrelationships among 
predictors and to explore the attributes and conditions leading to differing levels 
of job performance. The extent to which some critical variables, like ability level 
and job autonomy, moderate the relationship between personality and perfor- 
mance is also under study. In general, the goals of these investigations go beyond 
prediction, toward an understanding of the constructs and processes that deter- 
mine the large variations that exist in job performance (Barrick & Mount, 1991, 
1993; Hough, Eaton, Dunnette, Kamp, & McCloy, 1990; Matthews, Jones, & 
Chamberlain, 1992; McHenry et al., 1990; Schmidt & Hunter, 1992; Tett, Jack- 
son, & Rothstein, 1991). 

The more methodologically sophisticated research of the past decade or two 
has provided considerable support for the use of carefully constructed personality 
tests in employment decisions across a variety of settings. An important issue that 
still bears further investigation is that of determining the most relevant personal- 
ity dimensions relative to performance in particular jobs, or job families. For ex- 
ample, whereas a high degree of sociability might be required for some jobs, the 
opposite may be the case for others. Even a characteristic like conscientiousness, 
which may appear at first glance to be desirable for all jobs, might upon further 
investigation prove to involve attributes that may be counterproductive for cer- 
tain endeavors—such as creative pursuits—at least in some respects. As a result 
of considerations such as this, some attention is now starting to be directed to- 
ward establishing the temperamental and interpersonal requirements of jobs (R. 
Hogan, Hogan, & Roberts, 1996; Landy et al., 1994). 

A great deal of recent research activity has centered on exploring the useful- 
ness of the Five-Factor Model of personality structure, in its various forms, to the 
prediction of job performance criteria (see chap. 13 and R. Hogan, 1991). The 
popularity of the Five-Factor Model in personnel selection research is not coinci- 
dental. After all, the five (+ 2) factors were distilled from vast arrays of personality 
description data and represent a model for describing normal personalities upon 
which there is a fair amount of agreement. Each of the domains subsumed by the 
model is widely applicable to all sorts of everyday behavior, including the perfor- 
mance of job duties. The dimension of Emotional Stability, for instance, is essen- 


33See Anastasi (1985e) for an historical overview of personality testing in industry and for a discussion of several 
basic methodological and practical problems connected with it. Kanfer et al. (1995) provide a review of progress 
in the application of concepts from personality and intelligence theories to the field of industrial and organiza- 
tional psychology. 
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tial in jobs that require quick decision-making under stressful conditions, such as 
law enforcement, airplane piloting, truck driving, and medical emergency work. 
Agreeableness, on the other hand, is an indispensable quality for any job that in- 
volves extensive interpersonal contact. Not surprisingly, however, the Conscien- 
tiousness factor is the one that has emerged from. several meta-analytic 
investigations as the most ubiquitious and substantial personality predictor of job 
performance (Barrick & Mount, 1991, 1993; Schmidt & Hunter, 1992). Even so, 
agreement is still far from universal either with regard to the definition of the 
construct of Conscientiousness Or to the question of its primacy (see, e.g; 


Loevinger, 1994; Tett et al. 1991). 


Integrity Tests. The application of integrity or honesty tests in hiring decisions 
came into special prominence after. the passage of the Employee Polygraph 
Protection Act of 1988 (P. L. 100-347), which prohibited the use of polygraphs in 
screening personnel, except in special settings such as government agencies. Asa 
result of this legislation, paper-and-pencil measures of trustworthiness—widely 
known as “integrity tests" proliferated. These measures, which typically inquire 
about an applicant's attitudes toward and history of involvement in theft and other 
illicit behaviors, quickly came under scrutiny both from within and outside 
psychology)? In one of the most comprehensive investigations of integrity tests, 
Ones, Viswesvaran, and Schmidt (1993) conducted a meta-analysis based on 665 
validity coefficients. Their study estimated that the average operational validity of 
integrity tests for predicting supervisory ratings of job performance is Al and 
indicated that. these tests can also be of value in the prediction of disruptive 
behavior on the job. 

In spite of a fairly substantial degree of support gathered for the use of integrity 
tests, within certain parameters, their application in preemployment selection de- 
cisions remains problematic. Current concerns center on the tests’ possible sus- 
ceptibility to coaching and faking and on their relative ineffectiveness in 
predicting specific forms of job misbehavior, such as theft (see, e.g. Alliger, 
Lilienfeld, & Mitchell, 1996; Camara & Schneider, 1995). 


Leadership... The. selection of. effective leaders poses another significant 
challenge in. the field of personnel. decisions. Leadership is one of the most 
sought-after, qualities in the workplace because it encompasses the ability to 
persuade others to work toward a common, cause. Effective leadership can have 
powerful consequences in enhancing organizational functioning; ineffective 


za i itical reviews of the topic of lie detection and polygraphy written by a psychologist is 
LEE dde rb Scand more recent reviews and discussions can be found in DePaulo (1994), Honts 
(1994), Kircher and Raskin (1992), Lykken (1992), and Saxe (1994). 

35See; for example, the report of a task force that the American Psychological Association formed to look into 
this Sep (SR Goldberg, Grenier, Guion, Sechrest, & Wing, 1991). Further information is available in. 
; rehensive volume on the general topic of honesty in the workplace. For brief but in- 


K. R. Murphy's (1993) comi l 
E, SNE, ee testing in personnel selection, see Camara. and Schneider (1994), Sackett 
(1994), and the Comment section of the June 1995 American Psychologist. 
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leadership can be disastrous. Thus, it is not surprising that many types of 
cognitive and personality measures, as well as assessment center techniques, 
interviews, simulations, and biodata questionnaires, are applied to the selection 
of top management. Often, the assessment process involved in selecting high- 
level executives includes multiple methods and extends over a period of several 
hours or days. Such procedures are very similar to clinical testing and are typically 
quite costly. 

For the most part, apparently, the resources devoted to the careful selection of 
individuals who go from the regular workforce into leadership positions are rela- 
tively meager compared to the potential impact of these decisions. Possibly as a 
consequence of this situation, managerial incompetence seems to be exceedingly 
high (R. Hogan, Curphy, & Hogan, 1994). Although interest in studying leader- 
ship behavior has a long history in applied psychology, and has resulted in an ex- 
tensive literature on the subject, much remains to be done in terms of identifying 
and clarifying the empirical correlates of effective leadership. For further informa- 
tion on leadership measures and on theoretical and methodological advances in 
the study of leadership in organizations, see Bass (1990), Clark and Clark (1990), 
and Yukl and Van Fleet (1992). 


Instruments. With the exception of the intensive assessment procedures used 
in the selection of top-level executives and some specialized personnel—such as 
intelligence agents, astronauts, and the like—the vast majority of job applicants 
who take personality tests simply are given a self-report inventory of the type 
described in chapter 13. Inventories aimed at the detection of psychopathology, 
such as the MMPI, are still employed for screening purposes in some sensitive 
occupations, but concerns over invasion of privacy and the misuse of results have 
led to the predominant use of measures designed to assess personality in normal 
subjects (see chap. 18). In fact, several of the older multidimensional self-report 
inventories that originally were meant for normal subjects—for example, the 
CPI— have been recently revised in ways that will make them more appropriate 
for use in the workplace. In the meantime, some newer measures, like the Hogan 
Personality Inventory (HPI—R. Hogan & Hogan, 1992), have been designed 
primarily for occupational settings in the first place. Other tests are being 
developed for a much narrower range of applications; examples include the PDI 
Customer Service Inventory and the PDI Employment Inventory, which is an 
instrument designed to screen applicants who will be dependable and stable 
hourly employees (see Paajanen, Hansen, & McLellan, 1993). 


Concluding Comments. Technological developments are currently trans- 
forming the nature of work in ways that will far exceed the changes brought about 
by the industrial revolution. The rapid pace of change poses a tremendous 
challenge for personnel psychologists, while the methodological advances made 
possible by new technologies offer immense opportunities. 5 Just when it has 


36For a discussion of the transformations that are occurring in the workplace and the consequences they are 
likely to have for personnel work, see Landy, Shankster-Cawley, and Kohler Moran (1995). 
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become feasible to do the type of large-scale and highly sophisticated research 
that allows investigators to arrive at some fairly definitive answers to questions 
that have been around for a long time, the questions themselves are beginning to 
change. Meanwhile, the impact of computer technologies on personnel testing, 
while visible, is still only starting (Bartram, 1993; Burke, 1993; Schoenfeldt & 
Mendoza, 1991). Some of the more exciting applications of technology to 
testing—such as interactive multimedia tests—are just in the preliminary stages 
of development (Desmarais, Masi, Olson, Barbera, & Dyer, 1994; Drasgow et al., 
1996). Many more can be foreseen but have not yet been realized. Altogether, 
though, there seems to be more ample room than ever for the thoughtful 
application of psychological principles and methods to improving the utilization 
of human resources (Bray etal., 1991; Gatewood & Feild, 1993). s 


TEST USE IN CLINICAL AND COUNSELING PSYCHOLOGY 


Clinical and counseling psychologists employ a wide variety of tests, including 
most of the types already discussed. Periodic surveys have been conducted regatd- 
ing the use of tests by samples of clinical and counseling psychologists who differ 
in their theoretical orientations and work settings (Archer, Maruish, Imhof, & 
Piotrowski, 1991; Lubin, Larsen, & Matarazzo, 1984; Piotrowski & Keller, 1992; 
Watkins et al., 1994; Watkins, Campbell, Nieberding, & Hallmark, 1995). High 
in frequency of use are such individual intelligence tests as the Wechsler scales, as 


The present section of this chapter focuses primarily on areas that require the 


prognostic, 


Hersen, Kazdin, & Bellack, 1991; Hurt, Reznikoff, & Clarkin, 1991; Maruish, 
1994), whereas counseling psychologists originally specialized exclusively in voca- 
tional assessment and guidance (S. D. Brown & Lent, 1992; Drummond, 1996; 
Gelso & Fretz, 1992). Other major fields in which psychological assessment 
is practiced, such as school psychology and forensic psychology," derive their 


J Forensic psychology is the practice of psychology in courts of law, in relation to legal matters, or with adjudi- 


cated or incarcerated individuals. 
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identification from the settings where professionals work (Shapiro, 1991; Vance, 
1993; Weiner & Hess, 1987). Additional specialties utilizing psychological assess- 
ment techniques are defined by the types of problems they study, such as health 
psychology (N. Adler & Matthews, 1994; Streiner & Norman, 1995; S. E. Taylor, 
1990) and neuropsychology, a field which is concerned with brain-behavior rela- 
tionships (Benton, 1994; Maruish & Moses, 1996). Still others are identified by 
the kinds of patients or clients they serve—for example, children, couples, or 
families (Conoley & Werth, 1995; Fruzzetti & Jacobson, 1992; Kamphaus & 
Frick, 1996). Many practitioners work in even more highly specialized areas, like 
pediatric neuropsychology (Batchelor & Dean, 1996) or medical rehabilitation 
(Cushman & Scherer, 1995). All of these fields and subspecialties are evolving 
constantly. In addition, the boundaries between some specialties have become 
less clear and, in some cases, almost disappeared. 

A notable instance of this overlap is the growing rapprochement between clin- 
ical and counseling psychology (see, e.g., Anastasi, 1979, 1990a; May, 1990).38 
Counseling psychologists increasingly can be found working in private practice, 
clinics, and hospital settings. At the same time, some clinical psychologists are 
now applying their skills in career assessment and personnel selection (Lowman, 
1989, 1991, 1993). In a visible sign of the changes that are taking place, the pro- 
posed new Testing Standards address the use of tests for diagnosis and individual 
evaluation in clinical, counseling, and school settings within a single chapter en- 
titled “Psychological Testing and Assessment,” whereas the 1985 Standards dealt 
with clinical testing and test use in counseling separately (see chap. 1). 


Psychological Assessment. What, then, does psychological assessment 
entail as it is practiced across specialties, settings, problem areas, and populations? 
One of its main characteristics is a focus on the intensive study of one or more 
individuals (e.g., a couple or a family) through multiple sources of data. By 
establishing and maintaining rapport, the clinician may elicit from the client 
pertinent facts about her or his life history not readily accessible in other ways. 
Such life-history data provide a particularly sound basis for understanding an 
individual and predicting subsequent behavior. Furthermore, the clinician 
contributes to the fact-finding process by serving as a stimulus in an interpersonal 
situation. In this regard, the clinical interview functions as a situational test or 
simulation, providing a sample of the client’s interpersonal behavior under more- 
or-less controlled conditions. 

Information derived from observation, interviewing, and the case history is 
combined with test scores to provide an integrated picture of the individual 
(Beutler & Berren, 1995). The clinician thus has available certain safeguards 


38Because of this circumstance, as well as for the sake of convenience, in this section the term “clinician” will be 
used to denote any professional involved in the practice of psychological assessment. Nevertheless, it is useful to 
bear in mind that the educational backgrounds of these professionals may be in any one of several specialties 
and, thus, their orientations and areas of expertise do vary. For information about test use by clinical, counseling, 
industrial/organizational, and school psychologists in their traditional settings, see P. S. Wise (1989). : 


Major Contexts of Current Test Use 525 


against overgeneralizing from test scores. This fact probably accounts at least in 
part for the continued use of some tests whose psychometric properties are either 
weak or unproved. As long as such instruments serve primarily to suggest leads for 
the skilled clinician to follow up, their retention can be justified. There is a dan- 
ger, of course, that a relatively inexperienced and overzealous clinician, unmind- 
ful of the limitations of an instrument, may place more confidence in scores than 
is justified. & 

Another defining attribute of psychological assessment is its aim, which gener- 
ally is to help in-making informed decisions pertaining to differential diagnosis, 
career selection, treatment recommendations, educational planning, child cus- 
tody determinations, culpability, and many more such matters of practical signifi- 
cance to one or more persons. Decision-making proceeds by means of the 
collection, analysis, integration, and thoughtful reporting of relevant behavioral 
data. At the center of the process of psychological assessment is a continuing se- 
quence of hypothesis generation and hypothesis testing about the individual case. 
Each item of information—whether it be an event recorded in the case history, a 
comment by the client, or a test score—suggests a hypothesis about the individ- 
ual that will be confirmed or refuted as other facts are gathered. In this respect as 
well, it should be kept in mind that any single source of data, no matter how 
seemingly reliable, may yield inaccurate information at times. 

Fundamentally, all the activities connected with a psychological assessment — 
from the clear elucidation of its primary purpose to the communication of results 
— involve professional judgment based on knowledge about the specific problems 
and populations at hand: Moreover, the application of individual tests and other 
assessment tools requires specialized skills in their use and a careful consideration 
of their characteristics, in light of the purpose and context of the assessment 
(Drummond, 1996; C; T. Fischer, 1985; Groth-Marnat, 1990; G. Goldstein & 
Hersen, 1990; Hood & Johnson, 1997; Maloney & Ward, 1976; Tallent, 1992; 
Walsh & Betz, 1995). In this respect, the clinical method has been contrasted 
with the administration of objective, standardized tests and the use of statistical 
or actuarial procedures for combining data through regression equations, multiple 
cutoffs, and other “mechanical formulas" (chap. 6; also see Dawes, Faust, & 
Meehl, 1993; L. R. Goldberg, 1991; Kleinmuntz, 1990; Wedding & Faust, 1989). 

The ecological viewpoint, which emphasizes the need to take the context of a 
person's life into account; has had a significant influence in assessment work, as it 
has in developmental psychology and related fields (see, e.g Moen, Elder, & 
Liischer, 1995). Similarly, heightened awareness of the role of culture in all be- 
havior—including the problems that lead people to seek help from mental health 
professionals—has stimulated a great deal of interest in information about and 
guidelines for culturally competent assessment practice (see, e.g., American Psy- 
chiatric Association, 1994, pp- xxiv and 843-849; Dana, 1993, 1996; Prediger, 


1993; Suzuki et al., 1996). 


39A representative sample of current methodological issues in psychologícal assessment research can be found in 
a special issue of Psychological Assessment (September 1995, Vol. 7, #3). 
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igence Tests in the Context of Individual Assessment. Tests such 
SC ‘Wechsler scales and the Stanford-Binet (see chap. 8) are fundamentally 
individual, clinical instruments. When an alert and trained clinician is in an 
contact with an examinee during the hour or so required to administer 3 test, i e 
or she can hardly fail to learn more about that person than is conveyed - ek Q 
or some other single score. This is still true if a test is administere ya 
technician, provided that a complete record of the test taker's responses is 

ined. : 1 

MA using intelligence tests to assess an individual's general level of intel- 
lectual functioning, clinicians also customarily explore the pattern, or profile, of 
test scores for significant strengths and weaknesses. Profile analysis provides data 
that may be of help in the diagnosis of brain damage and various forms of psy- 
chopathology that affect intellectual functioning differently. The Wechsler scales 
have lent themselves especially well to such profile analysis, since all subtest 
scores are expressed in directly comparable standard scores. From the outset, 
Wechsler described a number of diagnostic uses of his scales. Since then, several 
clinicians have recommended additional techniques, and profile analysis has been 
applied with other instruments as well (Delaney & Hopkins, 1987; Elliott, 1990b; 
Kaufman, 1990, 1994; Matarazzo, 1972; Sattler, 1988, 1992). Most profile analysis 
techniques employ variations of three major procedures. The first involves an 
evaluation of the amount of scatter, or extent of variation among the individual's 
various scores, including: Verbal and Performance IQ differences; overall subtest 
scatter; and comparisons of individual subtest scaled scores with the means of var- 
ious groupings of subtest scores, such as verbal or speeded subtests. The second 
procedure consists of analyzing the salient features of an individual's profile in 
light of base rate data about the frequency or rarity of such features within the nor- 
mative group. The third approach is based on score patterns associated with partic- 
ular clinical syndromes such as Alzheimer's disease, learning disabilities, or 
anxiety states, Wechsler and other investigators have described patterns of high 
and low subtest scores as well as subtest combinations characterizing these and 
other disorders (see, e.g., Kaufman, 1990; Matarazzo, 1972). 

Several decades of research on the various forms of pattern analysis with the 
Wechsler scales have provided little statistical support for their diagnostic 
value.?? In fact, critics of this approach have attacked almost every aspect of it at 
one point or another (F. C. Goldstein & Levin, 1985; Kavale & Forness, 1984; 
Macmann & Barnett, 1994a, 1994b; McDermott, Fantuzzo, & Glutting, 1990). 
Nevertheless, judging from the continued popularity of the Wechsler scales in 
clinical practice as well as from the enormous literature designed to systematize, 
facilitate, and improve the utilization of score pattern analysis, it is clear that this 
is still the preferred approach to interpreting intelligence test data. 

At a purely qualitative level, any irregularity of performance may also suggest 
avenues for further exploration. Significant leads can emerge from the form as 


#See Anastasi (19852) for a discussion of some of the methodological requirements that must be considered in 
evaluating this research. 
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well as from the content of test responses. Bizarreness, overelaboration, or exces- 
sive self-reference, for example, may indicate personality disorders. A qualitative 
analysis of both errors and correct responses may provide useful. clues about 
problem-solving approaches, conceptual development, or cognitive styles. Atypi- 
cal content of test responses is a further source of leads. Still another source of 
qualitative data available during the administration of an individual intelligence 
test is the general behavior of the examinee in the testing situation. Examples in- 
clude motor activities, speech, emotional responses, and attitude toward the ex- 
aminer as well as approach to the test materials and the testing environment. As 
a rule, because of their idiosyncratic nature, such qualitative cues cannot be vali- 
dated by quantitative methods adapted to the measurement of group trends. Nev- 
ertheless, in recognition of the critical role that behavioral observations can play, 
some instruments designed to systematize and quantify the recording and the in- 
terpretation of some test session behavior have been appearing (see, e.g the 
Guide to the Assessment of Test Session Behavior for the WISC-III and the 
WIAT by Glutting and Oakland, 1992). 

A clear illustration of the sophisticated clinical use of intelligence tests—com- 
bining psychometrie data with qualitative observations—is provided by Ala: S. 
Kaufman. In his books on "intelligent"—and intelligence— testing, Kaufman 
(1979, 1990, 1994) demonstrates in detail how the clinician can integrate statis- 
tical information about test scores with knowledge about human development, 
personality theory, and other areas of psychological research. Kaufman empha- 
sizes the importance of considering both the skills and the extraneous conditions 
that can influence subtest performance as well as the need for supplementary in- 
formation—from other tests, case history, and clinical observation of test behavior— 
against which to intepret score patterns. Test scores, along with other data 
sources, lead to the formulation of hypotheses about the individual, which can be 
tested as more information is gathered to round out the picture. The most impor- 
tant feature of the Kaufman method is that it calls for individualized interpreta- 
tions of test performance, rather than uniform application of any one type © 
pattern analysis: The same score pattern may lead to quite different interpreta- 
tions for different individuals. : 

The basic approach described by Kaufman undoubtedly represents a major 
contribution to: the clinical use of intelligence tests (see, e.g., Roecker, 1995). 
Even its critics acknowledge that it has become the method of choice for teaching 
intelligence testing and has guided the creation of much of the software available 
to aid in the interpretation of intelligence tests (McDermott, Fantuzzo, Glutting, 
Watkins; & Baggaley, 1992).4! The guidelines provided by Sattler (1988, 1992) 


are also excellent illustrations of the combined psychometric and clinical use of 
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*#For i tion abour various criticisms of Kaufman’s method, as well as for his counterarguments, see 
ir esc of Kaufman's approach is that they seem 


man (1994, chap.1); One roblem with several of the negative reviews o e 
to assume that GE Zi use it to make decisions based solely on the magnitude of scores and score differ- 


ences. While it is true that the mechanical application ol 
assumption is quite contrary tO what Kaufman recomme! 


practice (Moreland et al., 1995). 


f profile analysis techniques can be very misleading, this 
nds, as well as to the principles of sound assessment 
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individual intelligence tests. It should be recognized, however, that the ER 
mentation of these approaches requires a sophisticated clinician who is well in- 
formed in several fields of psychology and not pressed for time. Moreover, the 
availability of computer-based interpretive programs—while in some ways facili- 
tating the application of these methods—can easily lead the hurried or less 
knowledgeable examiner to overinterpret the data the programs provide. 

In the meantime, new procedures for the utilization of intelligence test profiles 
are being developed, some of which combine elements from both the psychomet- 
ric and clinical perspectives. One interesting innovation being explored is the 
“core profile” approach. McDermott, Glutting, and their coworkers (Glutting, 
McDermott, Prifitera, & McGrath, 1994, 1995; McDermott, Glutting, Jones, & 
Noonan, 1989), as well as Donders (1996), have applied cluster analyses—of var- 
ious types—to the standardization data from the Wechsler scales. The goal of 
these analyses is to generate basic profile types that might help in classifying test 
results and in testing hypotheses about their clinical significance. A complemen- 
tary line of investigation uses multidimensional scaling to identify prototypical 
ability profiles in the population (Davison, Gasser, & Ding, 1996). This research 
is aimed at eventually quantifying the degree of correspondence between an indi- 
vidual’s observed profile and the prototypical profiles that emerge from a given test 
battery. Although these investigations are intriguing, as of now, they are still ex- 
ploratory and do not yet have clinical utility (see, e.g., Ryan & Bohac, 1994). 


Neuropsychological Assessment 


Methodological Problems in the Diagnosis of Brain Damage. Knowl- 
edge about the behavioral effects of brain damage dates largely from the writings 
of Kurt Goldstein and his associates in the early 1920s (Goldstein & Scheerer, 
1941). Following extensive observations of soldiers who had sustained brain 
injuries during World War I, Goldstein formulated his classic description of the 
intellectual impairment associated with brain damage. Among the principal 
symptoms were a diminution in the ability for abstract thought and a tendency to 
respond to extraneous stimuli that may disrupt normal perception. 

Widespread concern with brain injury in children arose in the late 1930s and 
in the 1940s, following the research of Alfred Strauss and his associates (Strauss 
& Lehtinen, 1947; H. Werner & Strauss, 1941, 1943). These investigators identi- 
fied a group of mentally retarded children whose case histories showed evidence 
of brain injury due to trauma or infection occurring before, during, or shortly after 
birth. The behavioral description of these children represented an extension and 
elaboration of the adult syndrome formulated by Goldstein. It delineated a dis- 
tinctive pattern of both intellectual and emotional disorders that was widely ac- 
cepted as characteristic of “the” brain-injured child. Included in this pattern were 
specific perceptual and conceptual disorders combined with relatively high verbal 
ability, as well as overactivity, distractibility, and aggressiveness. For many years, 
both research and practice with brain-injured children was dominated by a uni- 
dimensional concept of “organicity.” This approach led to a search for diagnostic 
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tests of organic involvement as such and an attempt to devise remedial or educa- 
tional programs suitable for brain-injured children as a whole. 

Since the 1950s, psychologists have increasingly recognized that brain damage 
may lead to a wide variety of behavioral patterns, a recognition that has spurred 
the development of clinical neuropsychology, a field that aims to apply what is 
known about brain-behavior relationships in the diagnosis and rehabilitation of 
brain-damaged individuals. No one symptom or set of symptoms need be common 
to all brain-injury cases. In fact, brain damage may produce the opposite behavior 
pattérn in two individuals; Such findings are consistent with the wide diversity of 
the underlying organic pathology itself. A significant advance in the analysis of 
brain-behavior relationships was made by the research of Ralph Reitan and his 
co-workers at the Indiana University Medical Center (see Matarazzo, 1972, chap. 
13; Reitan, 1955, 1966). These investigations showed that left-hemisphere le- 
sions tend to be associated with lower Verbal than Performance IQ on the Wech- 
sler scales (V € P). The opposite pattern (V > P) predominates in groups with 
lesions in the right hemisphere and with diffuse brain damage. 

Neuropsychological research has continued to sort out the complex interac- 
tions of other variables with the behavioral effects of brain pathology (see, e.g., 
Kolb & Whishaw, 1990). There is a good deal of evidence indicating that age af- 
fects the behavioral symptoms resulting from brain damage. The behavioral ef- 
fects will also depend upon the amount of learning and intellectual development 
that has occurred prior to the injury. Research on preschool children, for exam- 
ple, indicates that at this age level brain lesions tend to affect intellectual func- 
tions more massively than they do later on: 

Chronicity has alen been found to affect test performance and to interact with 
age effects. Available data suggest that the amount of time elapsed since the in- 
jury may be related not only to progressive physiological changes but also to the 
extent of behavioral recovery through learning or compensatory readjustments. 
Finally, it should be noted that in some cases intellectual impairment may be the 
indirect result of brain damage. Throughout the individual’s development, organic 
and experiential factors interact. Some of the behavioral difficulties included in 
the classic picture of brain-injured children, for example, may be an indirect ef- 
fect of the frustrations and interpersonal problems experienced by a child with an 
organically caused intellectual deficiency. Whether or not these behavioral diffi- 
culties persist may therefore depend on the attitudes and the degree of under- 
standing exhibited by parents, teachers, and other significant persons in the 
child’s environment. i m 

It is abundantly evident that brain damage covers a wide variety of organic dis- 
orders, with correspondingly diverse behavioral manifestations. The test perfor- 
mance of brain-injured persons can be expected to vary with the source, extent, 
and locus of the cerebral damage; the age at which the damage occurred; the age 
when the individual's behavior is being assessed; the duration of the pathological 
condition; and the extent and types of intervention the individual has received. 
To expect behavioral homogeneity among those affected by brain pathologies 


would thus be highly unrealistic. 
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From another angle, the same intellectual or behavioral disorder—and the 
same diagnostic sign in test performance—may result from organic, emotional, or 
mixed etiologies. A conspicuous example is persistent forgetfulness. Memory diffi- 
culties of various sorts can be a symptom of one of many types of dementias, with 
known organic causes, or of depressive disorders of emotional origin; in addition, 
the onset of an organic memory disorder is frequently accompanied by depression, 
presenting a mixed picture. The assessment of memory problems and the differen- 
tiation between dementia and depression ir. older adults, in particular, are among 
the most common referral questions in clinical neuropsychology (Butters, ‘Delis, 
& Lucas, 1995; Poon, 1986; Reeves.& Wedding, 1994; Storandt & VandenBos, 
1994). Experiential factors, which may be unrelated to the brain damage in some 
cases and may be related to it—more or less directly—in others, can complicate 
the diagnostic picture even further. Thus, the interpretation of any specific diag- 
nostie sign in test performance requires additional information about the individ- 
ual's experiential background and personal history. For instance, one important 
piece of information needed in order to gauge the magnitude of cognitive 
deficits—as well as the extent of recovery of intellectual functions—is the per- 
son's premorbid level of ability (see, e.g., Matarazzo, 1990). Educational attain- 
ment is frequently used as a rough indicator of premorbid functioning, but 
additional estimation methods based on historical data and post-traumatic test 
performance have also been devised (Vanderploeg, 1994b). 

Altogether, the practice of neuropsychological assessment is one of the most 
demanding clinical tasks, requiring—as it does—the application of knowledge 
about cognitive, personality, neurological, and general physiological functioning 
in both the normal and pathological ranges. It is, therefore, not surprising that 
the research literature in this field and practical handbooks designed to guide stu- 
dents and practitioners continue to accumulate at a prodigious rate. One exten- 
sive compendium of research and clinical information is in the multivolume 
Handbook of Neuropsychology, edited by Boller and Grafman (1988—1995). Addi- 
tional reference works on the general practice of neuropsychological assessment 
and intervention have been prepared recently by Adams, Parsons, Culbertson, 
and Nixon (1996), Golden, Zillmer, and Spiers (1992), Lezak (1995), Touyz, 
Byrne, and Gilandas (1994), Vanderploeg (1994a), and R. E White (1992). More 
specialized treatises on such topics as forensic neuropsychology (Valciukas, 1995), 
neuropsychological assessment of occupational neurotoxic exposure (Agnew & 
Masten, 1994), and neuropsychological evaluation of Spanish speakers (Ardila, 
ae & Puente, 1994), among many others, continue to appear at a rapid 


ee Instruments. A large number of tests have been special- 
as clinical instruments for assessing neuropsychological impairment? 
(see, ¢.g., Lezak, 1995; Spreen & Strauss, 1991). These tests are often designated 


WI th 
iem nor Af are regularly reviewed in the Mental Measurements Yearbooks (see TIP-IV, p. 1116, for a 


Major Contexts of Current Test Use 53] 


as indicators of “organicity” or brain damage. Chief among the functions assessed 
by these instruments are- those- considered most. sensitive» to pathological 
processes, such as perception of spatial relations and memory for newly learned 
material. The Bender Visual Motor Gestalt Test, commonly known. as the 
Bender-Gestalt Test. (BGT—Bender, 1938; Canter, 1996; Heaton, Baade, & 
Johnson, 1978; Koppitz, 1964, 1975; Pascal & Suttell, 1951), and the Benton 
Visual Retention. Test—Fifth. Edition (BVRT—Sivan, 1991) exemplify these 
types of instruments and have been used as screening tests for many decades. 
Because of the wide diversity of organic. brain dysfunctions and. accompanying 
behavioral deficits, however, no one: test is adequate for screening for brain 
damage in general; single tests like the BGT and BVRT are even less suited for 
differential diagnosis: i 

Clinical neuropsychologists frequently use a combination of available tests as- 
sessing different skills and deficits, ina fashion that has been labeled as the "flexi- 
ble battery" approach (see, e.g., Bauer, 1994; Goodglass, 1986). This procedure 
has the advantage of providing combinations of tests that are tailored to the prè- 
senting problems in individual cases. Put it also has several limitations. There is 
likely to be some unnecessary duplication of functions among tests, and some crit- 
ical areas may be overlooked. The advance selection of tests appropriate for each 
case puts a heavy burden on the clinician’s expertise and judgment. Moreover, in- 
dependently developed tests are not likely to be comparable with regard to norms 
and score scales. Empirical data on the interrelations of different tests are also 
likely to be meager. Consequently, it is difficult to interpret results in. terms of 
score patterns. gis ngu Ae e j 

For these reasons, systematic efforts have been made to assemble comprehen- 
sive standardized batteries that provide measures of all significant neuropsycho- 
logical skills, Such a:batrery. can. serve several functions. It can. detect brain 
damage with a high degree of success. It can also help to identify and localize the 
impaired brain areas. It:can differentiate among, particular syndromes associated 
with cerebral pathology. And it can help in planning rehabilitation training by 
revealing the specific type and extent of behavioral deficits. Two major examples 
hological batteries are the Halstead-Reitan. Neu- 
ropsychological Test Battery (HRB—Reitan & Wolfson, 1993) and the pun 
Nebraska Neuropsychological Battery (LNNB—Golden, Purisch, & Hammeke, 
1985). Both of these batteries have versions for use with children and share a 
common purpose, but they differ im several important ways. The HRB, which is 
the older of the two, was developed by Reitan from the work of Halstead (1947) 
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Luria, 1973, 1980); it is more highly standardized in content, materials, adminis- 
tration, and scoring and requires considerably less time to administer than the 
HRB (for reviews, see J. H. Snow, 1992; Van Gorp, 1992). 

Recent developments in the direct assessment of brain impairment through 
electroencephalography and neuroimaging techniques, such as magnetic reso- 
nance imaging (MRI) and positron emission tomography (PET), are having a 
profound impact on basic and clinical neuropsychology. Although available tech- 
nologies are continually being improved, no diagnostic technique is 10096 de- 
pendable. In most cases, neuropsychologists work with neurologists and other 
specialists to obtain corroborative information from various sources. In the clini- 
cal context, carefully standardized behavioral measures serve an important func- 
tion in assessment and in the planning and monitoring of rehabilitation programs, 
along with other procedures. And, in the realm of basic science, the integration 
of neuropsychological and neuroimaging methodologies offers some of the great- 
est promise in advancing knowledge about brain-behavior relationships (see, e.g., 
Gur & Gur, 1991, 1994). 


Identifying Specific Learning Disabilities 


Since the 1970s, there has been a continuous wave of programs for the diagnosis 
and remediation of learning disabilities (LD). Educators have become increas- 
ingly aware of the high frequency of this type of handicap among schoolchildren, 
and even among college students and other adults (see, e.g., Gregg, Hoy, & Gay, 
1996; Kravets & Wax, 1992; Wang, Reynolds, & Walberg, 1991), although the 
large number of individuals placed in this category may reflect in part misclassifi- 
cation resulting from the looseness with which the term has been applied. The 
varied terminology applied to these cases reflects both changing approaches over 
time and differences between the medical, educational, and psychological orien- 
tations to LD (see, e.g., American Psychiatric Association, 1980, 1994). Accord- 
ing to the definition set forth by the federal government in Public Law 94-142, 
the legislation which provides for the education of children with various disabili- 
ties (see chap. 9), a specific learning disability is described as 


;. -a disorder in one or more of the basic psychological processes involved in under- 
standing or in using language, spoken or written, which may manifest itself in an 
imperfect ability to listen, think, speak, read, write, spell, or do mathematical calcu- 
lations. The term includes such conditions as perceptual handicaps, brain injury, 
minimal brain dysfunction, dyslexia, and developmental aphasia. The term does not 
include children who have learning problems which are primarily the result of vi- 
sual, hearing, or motor handicaps, of mental retardation, of emotional disturbance, 
op E E cultural, or economic disadvantage. (Federal Register, 1977, 
p. 


PL. 94-142 further specifies that the diagnosis of learning disability should be 
applied only to children who (1) show a "severe discrepancy" between intellec- 
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tual ability and achievement in various communication and mathematical skills 
and (2) are not achieving in a manner commensurate with their age and ability 
levels, even when provided with appropriate education.** It follows from this defi- 
nition that the designation of learning disability should not be employed until sev- 
etal other conditions have been eliminated as possible causes of the child's 
educational or psychological difficulties. Over the years, it has been recognized in- 
creasingly that the heterogeneous LD population comprises subgroups that can be 
differentiated in terms of symptom clusters (Rourke, 1990; Feagans, Short, & 
Meltzer, 1991; Geary, 1993; S. R. Hooper & Willis, 1989; Pennington, 1991; 
Shankweiler etal; 1995; H. L: Swanson & Keogh, 1990). Nevertheless, there are 
still wide differences in theoretical orientation. even among specialists on LD. 
These differences are reflected in both testing instruments and remedial programs. 

Typically, children with learning disabilities show normal or above-normal in- 
telligence, in combination with pronounced difficulties in learning one or more 
basic educational skills (most often, reading). It should be noted, however, that 
specific learning disabilities may occur at any intellectual level, even though 
mentally retarded children with LD do not meet the legal definition of learning 
disabled. LD children also manifest various combinations of associated behavioral 
symptoms. Chief among such symptoms are difficulties in perceiving and encod- 
ing information, poor integration of input from different sense modalities, and 
disruption of sensorimotor coordination. Disturbances of language development 
typically occur in LD children. Deficiencies in memory, attention control, and 
conceptual skills are common, as are certain emotional and motivational symp- 
toms. In particular, aggression and other affective and interpersonal problems may 
develop, often as a direct result of the academic failures and frustrations engen- 
dered by the child’s learning disabilities. In evaluating the child’s behavior, one 
should bear in mind that many specific difficulties that are normal at an early age 
(e.g., ina three-year-old) represent a dysfunction if they persist to an older age. A 
developmental frame of reference is therefore needed, with qualitative if not 


quantitative norms. 


Assessment Techniques. Regardless of theoretical orientation, there is 
general agreement that the identification of learning disabilities requires a wide 
assortment of tests and supplementary observational procedures. This follows 
from at least three features of the diagnostic problem: (1) the variety of 
behavioral disorders associated with this condition, (2) individual differences in 
the particular combination of symptoms, and (3) the need for highly specific 
information regarding the nature and extent of the disabilities in each case. 
Usually, the assessment of children with learning disabilities is a cooperative ef- 
fort by a professional team. The classroom teacher can administer group tests and 
employ other screening or wide-band instruments. Regular achievement batteries 
may also serve in this connection, particularly those designed for primary and 
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preschool levels and permitting "criterion referenced" (ie, content-referenced) 
analyses of specific strengths and weaknesses Several instruments cited in the edu- 
cational testing section of this chapter are appropriate for this purpose. 
Individually administered, wide-range achievement tests are especially useful 
for the assessment of learning disabilities. These tests may be given by teachers, 
although the supplementary qualitative observations and score interpretations 
can be handled more effectively by a clinician. One of the instruments in this cat- 
egory is the Kaufman Test of Educational Achievement (K-TEA—Kaufman & 
Kaufman, 1985). The administration of this test, with the pervasive easel-type 
kit, is shown in Figure 17-4. A few batteries include some subtests classified as 
measures of aptitudes or cognitive ability and some classified as measures of 
schievement. Among the most comprehensive of these is the Woodcock-Johnson 
Psycho-Educational Battery—Revised (W]-R—McGrew, 1994; McGrew, Werder, 
& Woodcock, 1991). Such rests as the Stanford-Binet, the K-ABC, and the 
Wechsler scales not only provide a global index, such as an IQ, to help differenti- 
ate between mental retardation and specific learning disabilities, but they also 
yield much qualitative information regarding specific deficiencies (Kaufman, 
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concepts, 
of language disorders. The publishers of the Wechsler scales have also developed 
the Wechsler Individual Achievement Test (WIAT — Psychological Corporation, 
1992). a comprehensive battery that is linked to the intelligence scales and in- 
tended to complement them in the assessment of learning tes 0 
Although existing tests can be helpful in the implementation of current guide- 
lines for the assessment of learning disabilities, many investigators have repeat- 
edly called for a new and more informative approach to the diagnosis and 
evaluation of LD children. These calls center primarily on the need for a clear 
theoretical rationale to guide assessment and for a deeper understanding 
of the specific processes involved in’ “case, in order to improve both assess- 
ment and remediation efforts (see, GE: Das, Naglieri, & Kirby, 1994; R. B. Kline, 
Snyder, & Castellanos, 1996; C. R. Reynolds, 199b) '"'' * 


Dynamic Assessment, The term "dynamic assessment” covers a variety of 
clinical procedures that involve essentially the deliberate from stan- 
dardized or uniform test administration in order to elicit qualitative 
data about an individual. Although skilled clinicians have been employing 
procedures for some time, this approach has | in popularity since the 

(Lidz, 1981, 1987, Ver? 1995), It has been used as à source of 
data, not only in cases o learning disabilities. : 

who have been experiencing difficulties with their schooling, suchas siose with 


mild or prend mental retardation. sa a? Le 


nr 


| 


assessment of gines Tred (aie, Dol 
is en tenta ep, Bolig & 
"Orie of the earliest of these tative adaptations of testing procedure is 
known as “testing the limits.” In^ procedure, additional cues m be provided 
by the examiner; eins” e e 
greater the learning disability. The alterations of standard procedure Geesen: 
ing the he lare e to some of the special adaptations in testing. 
prie e: eer EH s t aeaa a 
tions apply to the Sim SS NES 

A more recently ‘approach has been. learning. potential 
assessment (Babad degen 
Glutting & McDermott, 1 Hamer, 


3 
3 
l 
E 


1990; 
1991). The term "porential" in 
pir the ability lorc gend 
ered.” In effect, however, 
which In ffec bowie by various means to perform a task thar he or e Pb 
initially unable to carry out. Superficially, this technique is similar to - 
lowed y unable ehe pratt educational rents in which test rakers ie e 
sample task requiring the sott of leaming they will encounter in a particular 
course ofatudy, The present application differs from char of prognostic tests in ar 


always 
procedures involve a test-teach-test format in 


$36 APPLICATIONS OF TESTING .. 


least two. ways: (1) che test taker is given instructions or special suggestions, and 
Ki - tasks usually involve more broadly applicable learning or problem-solving 

i vm 

The dynamic assessment techniques pioneered by Feuerstein and others offer 
promise in several directions. By linking assessment and instruction, they pro- 
mote research on the modifiability of scholastic aptitude and on the development 
of optimal remediation programs. In addition, they provide the qualified clinician 
an assessment tool that can produce more illuminating descriptions of cognitive 
performance and of its responsiveness to remedial interventions than standard»: 
ized intelligence tests. In spite of these advantages, dynamic techniques have not 
escaped criticism. One question concems their transportability, or the extent to 
which they can be used effectively by different clinicians, Another pertains to the 
generalizability of the remedial effects achieved with these very broad tasks (usu- 
ally nonverbal reasoning problems similar to the Raven's Progressive Matrices or 
to the Weschler Performance scale items) to real-life school performance. Fur- 
thermore, although most proponents of dynamic assessment techniques would 
like to get away from the reliance on scores that is typical of "static" cognitive 
measures (such as intelligence tests), the problem of how to document change 
without using number—in some fashion—has proven ro be intractable 
(A. L. Brown et al., 1992; R. E. Snow, 1990), Nevertheless, the exploration of the 
potential usefulness of dynamic assessment techniques continues apace. Some of 
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Behavioral Assessment 


The various techniques subsumed under the general concept of behavior modifi- 
cation represent a direct utilization of major learning principles in the practical 
management of behavior change. Basically, these techniques involve the opel: 


tion of conditioning principles to the acquisition and strengthening of wanted 
havior and the elimination of aia ocak t, Behavior therapy has gradually 
broadened to comprise an expanding variety of psychological problems, an ever- 
expanding repertoire of intervention techniques, and the consideration of cogni- 
tive and affective—along with overt motor—responses. (Bandura, 1969, 1986; 
Hersen et al., 1991; Lazarus, 1981). ; i i 


Assessment, Techniques. In the early applications of behavior therapy, 
assessment received little attention. Since the mid-1970s, however, the impor- 
tance of assessment procedures has been increasingly recognized (Barrios, 1988; 
A. S. Bellack & Hersen, 1988; Haynes, 1991; & Terdal, 1988; Nelson & 
Hayes, 1986, O'Brien & Haynes, 1993; Ollendick & Hersen, 1993). The 
principal functions to be served by assessment procedures in behavior therapy can 
be subsumed under three headings, Pioaan techniques help in g 
the individuals problem through a functional analysis of relevant behavior. 
Essentially, such an analysis involves a full spec tion of the treatment 
objective—such as overcoming a phobia or obsessive thoughts—and includes a 
description of the stimuli that elicit the rarget behavior, the situations in which 
such behavior occurs, and the nature, magnitude, frequenc! 
responses, A second way in which assessment procedures can guide the behavior 
therapist is in selecting appropriate treatments. Third, there is a need for the 
behavior change that results from, treatment, Such assessments should include 
techniques for monitoring change—so as to permit evaluation of treatment 
effectiveness and the introduction of procedural alterations if necessary—as well 
as terminal measures to establish attainment of a satisfactory status and to plan for 
follow-up procedures as needed. 

In considering specific assessment procedures, we should note, first, that the 
same procedure can often provide information relevant to all three functions. 
Second, the choice of procedures depends on the nature of the problem, the char- 
acteristics of the client—including the environment in which he or she must 
operate—and the facilities available at the particular clinic. Third, a combina- 
tion of several assessment procedures is desirable in most cases. 

Available assessment procedures can themselves be classified under three 
major types: self-report by the client, direct observation of behavior, and physio- 
logical measures. Although not every center can afford the facilities required for 
physiological measures, these measures provide supplementary objective data in the 
assessment of certain conditions, such. as anxiety, sexual arousal disorders, and 
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sleep disorders (Sturgis & Gramling, 1988). Examples include measures of elec- 
trodermal, muscular, and electro-ocular activity, as well as of cardiovascular, sex- 
ual, and cerebral functioning. 

Direct observation of the target behavior can be conducted in naturalistic situa- 
tions by parents, teachers, institutional personnel, or special observers. Observa- 
tional aids such as checklists, rating scales, and daily schedules may be employed. 
Such observations are subject to several weaknesses (Barrios, 1993; also see chap. 
16). For this reason, analogue situations are often employed in clinics. There is 
also increasing exploration of the use of mechanical sensors or monitors that can 
provide a continuous, objective record of behavior in both real-life and contrived 
situations (W. W. Tryon, 1985, 1991). 

Self-report by the client comprises a diversity of techniques. They include clini- 
cal interviews by the therapist, self-monitoring records of target behavior and as- 
sociated conditions kept by the client, and a variety of written checklists and 
inventories. Some standardized self-report inventories are used, in either the orig- 
inal or adapted versions, for preliminary screening and identification, as well as 
for monitoring of target behaviors. Among the simplest and most widely used of 

these is the revised Beck Depression Inventory (BDI—Beck & Steer, 1993), a 21- 
item instrument designed to assess the severity of depression through self- 
ratings.*° One of the newest standardized measures is the Alcohol Use Inventory 
(Horn, Wanberg, & Foster, 1990), a 228-item self-report inventory that assesses 
involvement with the use of alcohol in a conceptually and psychometrically so- 
phisticated fashion, involving multiple scales at different levels of generality (for 
reviews, see Drummond, 1995; McNeely, 1995). Many other instruments have 
been prepared for use in specific research projects or treatment programs, and, al- 
though some of them have not themselves been published, they are usually fully 
described and reproduced in journal articles or books (see, e.g., A. S. Bellack & 
Hersen, 1988). 

Recently, other instruments have been developed which comprise objective 
rating scales to be completed by multiple informants. One of the most compre- 
hensive of these is the Behavior Assessment System for Children (BASC) by 
Cecil Reynolds and Randy Kamphaus (1992), already mentioned in chapter 16. 
This system includes behavior rating scales for parents and teachers and a form for 
coding and recording direct observations of classroom behavior; in addition, the 
BASC provides a self-report questionnaire for the children themselves and a 
structured interview schedule through which parents can provide a developmen- 
tal history. The Social Skills Rating System (SSRS—Gresham & Elliott, 1990) 
also provides forms for parents, teachers, and students themselves to evaluate the 
positive and problematic behaviors of students in educational and family settings. 
One of the particularly helpful features of the SSRS, according to reviewers, is 
that it provides a component that permits the linkage of the assessment results 


with the planning of intervention strategies (f i ; Fur- 
Spa aig gies (for reviews, see Benes, 1995; Fur. 
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Career Assessment 


The practice of career assessment involves helping an individual discern the most 
appropriate career choice, in light of the person's abilities, interests, goals, values, 
and temperament, as well as the requirements of a given occupation. Few areas of 
life are as important to people as their occupations, not only because of the great 
amount of time most persons spend at work but also because work can usually pro- 
vide opportunities for many intrinsic and extrinsic rewards (Super & Sverko, 
1995). In addition, the current climate of rapid changes in the nature and condi- 
tions of employment is prompting many more people to consider their career 
choices not just once but often several times within their lifetimes. Thus, it is not 
surprising that theorizing in the field of career choice and career development is 
proceeding at a very lively pace. Already in the 1990s, the number of major new 
theoretical formulations in the area has equaled or exceeded those produced since 
the 1950s when both Donald Super (1953) and John Holland (1959) made their 
first significant contributions to the field.*” 

From the point of view of testing, we have already considered the instruments 
that are most directly applicable in the career counseling of individuals, 
namely—the interést inventories discussed in chapter 14 and the multiple- 
aptitude batteries described in chapter 10, as well as in the occupational testing 
section of this chapter. Choosing a career often implies the choice of a general 
lifestyle, with its characteristic set of values. Because interest inventories essen- 
tially assess the individual's value system, they are coming more and more to be 
regarded as focal in effective career planning. In this section, we examine two 
more specialized kinds of instruments that have been particularly designed for ca- 
reer counseling: comprehensive programs for career exploration and measures of 
career maturity. Many more instruments are described and reviewed in the indis- 
pensable guide to career assessment instruments of all types prepared by Kapes, 
Mastie, and Whitfield (1994), now in its third edition. 


Comprehensive Programs for Career Exploration. Several multiple 
aptitude batteries have been incorporated into career guidance systems. An 
example is provided by the Differential Aptitude Tests (DAT), described in 
chapter 10, which can be used in combination with the Career Interest Inventory 
(CII— Psychological Corporation, 19912, 1991b). These two instruments Were 
developed jointly in order to facilitate comparisons between their results in the 
process of career guidance. 
Another example is the program developed by the U.S. Employment Service, 
whose General Aptitude Test Battery (GATB) was discussed in the occupational 
testing section of this chapter. Among the most useful tools that resulted from the 
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USES career counseling program are the Complete Guide for Occupational Explo- 
ration (CGOE—Farr, 1992) and the Enhanced Guide for Occupational Exploration 
(EGOE—Maze & Mayall, 1995). Intended for use by counselors as well as by stu- 
dents and job seekers themselves, these guides group thousands of occupations in 
the world of work by major interest areas and by ability patterns and other re- 
quirements for successful performance. The individual can use the guides for pre- 
liminary career exploration by identifying those work groups in which he or she 
has a strong interest and then checking the training and skills they require. The 
CGOE lists all 12,741 occupations listed in the Dictionary of Occupational Titles 
(U.S. Department of Labor, 1991), whereas the EGOE lists only 2,800—compris- 
ing 95 percent of the workforce—but provides more information on each. 

A newer approach to career counseling provides a procedure for integrating 
available information from many sources into a comprehensive career exploration 
program. The information may include scores from a variety of tests (each with its 
own normative and interpretive data), biographical data (including education 
and job experience), and the individual’s expressed interests, preferences, and 
value system. This approach is illustrated in varying degrees by several available 
instruments, such as the Harrington-O’Shea Career Decision-Making System— 
Revised (Harrington & O'Shea, 1993) and the American College Testing (ACT) 
Career Planning Program (ACT, 1994). 

An outstanding example of such integrative career-exploration programs is the 
revised version of the System for Interactive Guidance Information (SIGI- 
PLUS), cited in chapter 3. Using an interactive program, SIGI-PLUS enables the 
individual to conduct two-way communication with the computer—asking and 
answering questions, providing data, and soliciting information. An extensive 
database on job characteristics and requirements is included in the program, with 
provision for incorporating additional local data. Originally designed for use with 
college and university students, SIGI-PLUS has been updated for use with adults 
wishing to make career changes or enter the job market at different life stages. 
The program is so designed as to guide the individual in examining relevant facts 
and proceeding along a systematic route toward effective decision making (M. R. 
Katz, 1993; Norris, Schott, Shatkin, & Bennett, 1986). Even such a well-de- 
signed computer-administration system, however, cannot perform the whole ca- 
reer-decision function for the individual. The intervention of a trained counselor 
may be needed to encourage the individual to think through her or his particular 


ie and traits at different stages of career development (see, e.g., Tiedeman, 
4). 


Assessment of Career Maturity, Another type of instrument specifically 
developed for use in career counseling is concerned with the individual's level of 
career maturity. This concept emerged from a long-term research project on 
career development conducted by Super and his associates (Super et al., 1957; 
Super & Overstreet, 1960). Career maturity refers to individuals’ mastery of the 
vocational tasks appropriate to their age level, and their effectiveness in coping 
with such tasks. The research itself comprised a 20-year longitudinal investigation 
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of approximately 100 ninth-grade boys. The findings suggested that the major 
career development task at the junior high school level is that of preparing. to 
make career choices. Other research on career. development, by both cross- 
sectional and longitudinal approaches, has been contributing data that help fill 
out the picture (see; e:g.; Grites, 1969; Gribbons & Lohnes, 1982; Super, 1980, 
1985). A. major finding of the more recent research is that all types of career 
behavior may occur throughout most of the life span, even though.one type 
predominates at each. life stage. Increasing attention is being given to career 
changes in adult life; these changes stem from a variety of factors both in one’s 
own personal life and in the nature of occupations (Kummerow, 1991; Lowman, 
1991, 1993; Pickman, 1994; Walsh & Osipow, 1993). 

One of the by-products of the research on career development has been the 
construction of standardized measures of career maturity (Kapes et al., 1994, pp. 
241-272). An example is the Career Development Inventory prepared by Super 
and designed to assess readiness to decide on a career and to highlight any aspects 
of career orientation in which individuals need assistance (A. S. Thompson & 
Lindeman, 1981, 1984). A related type of instrument is the Career Beliefs Inven- 
tory (Krumboltz, 1991), which was developed as an aid to career counseling with 
the specific goal of identifying any beliefs that may be blocking an individual's at- 
tainment of career goals. 


Clinical Judgment 


What the clinician does in assessing a client may be regarded as a special case of 
person cognition, or interpersonal perception—the process through which any- 
one comes to know and understand another person (Kruglanski, 1989). In the 
clinical situation; however, judgmental accuracy differs from ordinary social cir- 
cumstances in several significant ways: Much has been written about the data- 
processing, synthesizing, or interpretive role of the clinician. Research on the 
process of clinical judgment has thrown considerable light on some of the possible 
sources of error in this process, such as the influence of cultural stereotypes and 
reliance on fallacious prediction principles. Examples of the latter include failure 
to consider base rates or regression effects, and the assumption. that more highly 
intercorrelated predictors yield higher validity (see, e-g., L- R. Goldberg, 1991). 
Given the same set of facts, such as test scores or life-history data, will clinical 
judgment provide more accurate predictions of subsequent behavior than would 
be obtained by the routine application of a regression equation or other empiri- 
cally derived formula? The question has practical as well as theoretical impor- 
tance, because once a statistical strategy—or algorithm—has been developed, it 
can be applied br a clerk or a computer, thus freeing the clinician for other func- 
tions, In a classic book entitled Clinical Versus Statistical Prediction, Meehl (1954) 
discussed the process of clinical judgment and surveyed investigations comparing 
the two types of prediction. Meehl showed that, with only one questionable ex- 
ception, the routine application of statistical procedures yielded at least as many 
correct predictions as did clinical analysis, and frequently more. The publication 


542 APPLICATIONS OF TESTING 


of Meehl's book, and subsequent reaffirmations of his conclusion, stimulated 
a lively controversy that is still active (see, e.g., Anastasi, 1988b, pp. 511—515; 
Dawes, Faust, & Meehl, 1993; Kleinmuntz, 1990). 

In spite of the apparent superiority of the actuarial approach, when clinical and 
statistical procedures are applied to the same data, it is important to bear in mind 
that the clinical process offers certain advantages. A major contribution of the 
clinical method, for example, is that data are obtained in areas where satisfactory 
tests are unavailable, through careful interviewing and observations of behavior. 
The clinical method is also better suited than the statistical method to the pro- 
cessing of rare and idiosyncratic events whose frequency is too low to permit the 
development of statistical strategies. 

In summary, the most effective procedure in our present state of knowledge 
combines clinical and statistical approaches (Matarazzo, 1990). The clinician 
should use all the objective test data and actuarial strategies applicable to the par- 
ticular situation, while supplementing this information with facts and inferences 
attainable only through clinical methods. The validity of clinical predictions 
against actual outcomes should be investigated systematically whenever feasi- 
ble.*5 More data are also needed on the consistency of predictions about the same 
persons made by different clinicians and by the same clinicians at different times. 
Insofar as possible, the process and cues on which clinical predictions are based 
should be made explicit in the clinical records. Such a practice would not only fa- 
cilitate research and training but would also serve to encourage reliance on sound 
data and defensible interpretations. Finally, the “clinician as instrument" is an 
important concept. This is illustrated in ongoing research on characteristics that 
moderate clinical judgment accuracy. "Through a series of ongoing investigations, 
Spengler and his colleagues have concluded that psychologists with lower levels 
of cognitive complexity are more likely to form biased clinical judgments than 


those with higher cognitive complexity (Spengler & Strohmer, 1994; Walker & 
Spengler, 1995). 


The Assessment Report: A Final Synthesis. In chapter 18 we will 
examine some of the broad issues involved in the communication of test results, 
with particular reference to ethical and social implications. For the clinician, 


Some large-scale projects of this nature, aimed at evaluating the validity of decisions or icti 

assessment procedures, are currently veing planned. One such Beda aedibus Curtii 
pices of the Society for Personality Assessment, is a complete literature review—to be followed by meta- 
analyses—of studies that have used personality assessment procedures or instruments to make predictions about 
aspects of therapy or about medical, legal, or life outcomes (Handler & Meyer, 1996, Spring/Summer). 
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In its content, the report should draw upon all the data sources (test and nontest) 
available to the clinician. 

Several books provide guidelines for report writing. Without duplicating the 
many lists of suggestions that can be found in such sources, we shall focus on some 
of the major points. First, there is no one standard form or outline for all reports. 
Both content and style should and do vary with the purpose of the assessment, the 
context in which it is conducted, the persons to whom the report is addressed, 
and the theoretical orientation and professional background of the clinician. It is 
especially important to adapt the report to the needs, interests, and background of 
those who will receive it. For example, a report addressed to a lawyer needs to be 
quite different from oné addressed to a psychotherapist. Nevertheless, for both of 
them, the clinician should select what is relevant to answering the questions 
raised at the outset from the mass of data he or she has gathered. 

The report also should concentrate on each individual's differentiating charac- 
teristics—the high and low points—rather than on traits in which the individ- 
ual’s standing is close to the average. A test of the effectiveness of a report is to see 
whether it is unique to the individual or whether it applies equally well to other 
persons. It is a relatively easy task to prepare a pseudo-report from general, stereo- 
typed statements that apply to most people. A considerable body of research has 
demonstrated that such reports are readily accepted as “remarkably accurate” self- 
descriptions by a large majority of persons (Goodyear, 1990; Klopfer, 1983; Sny- 
der & Larson, 1972; Tallent, 1992, pp. 236-238). This pseudo-validation has 
been called the “Barnum effect,” after Phineas T. Barnum, the famous showman 
who is credited with the remark that "there's a sucker born every minute.” Re- 
liance on such generally applicable personality descriptions is a favorite device of 
fortune tellers and other charlatans. ? : 

The primary focus of the report should be on interpretations and conclusions, 
although test records and other detailed data may be separately appended in some 
cases.?? Specific data, such as individual responses and subtest scores, moet 
narily be cited only to illustrate or clarify a point: Reports should be carefully = 
ganized and integrated. They should be written simply so as to communicate an 
not to obfuscate. Books on the preparation of assessment reports usually de. 
helpful hints for good writing, as well as references to standard an cu 
One particularly entertaining little book that should make writing F3 ies or 
both writer and teader is The Elements of Style by Strunk and White ( a 


The Role of Computers in Psychological Assessment 


i i i ^ lity inven- 
Computer scoring of many kinds of tests, including self-report persona 
tories, has been available for several decades. Most of the computer programs also 


Sattler (1988, chap. 23) gives a comprehensive view SE pnd pie Ss d ubt 
and summarizes pertinent research on report writing. Other Ws 

Ownby (1991) and Tallent (1993). ; " 

50See the ger 1996 American Psychologist for a “Statement on ; i aen opto er ES 2 
Committee on Psychological Tests and Assessment of the American Psychologic 
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carry out routine statistical analyses, providing such further information as vari- 
ous types of derived scores, score intervals in terms of the SEM, and score profiles, 
Narrative consultative reports constitute a more sophisticated application of 
computer technology to the reporting and utilization of test data (Butcher, 1987; 
Moreland, 1992). Essentially, these programs use a large database of qualitative 
interpretive statements that are attached to particular levels or patterns of quanti- 
tative scores. Apart from saving the clinician’s time, this procedure has other ad- 
vantages. The computer is able to search, systematically and consistently, through 
a far more extensive database than would be feasible for an individual clinician; it 
can also reliably apply more complex rules to the actuarial data in selecting ap- 
propriate statements; and it can insert into the process many other related vari- 
ables, such as demographic data from different normative populations.”! 

Unquestionably, the potential contribution of computers to psychological assess- 
ment is impressive (Butcher, 1987; Eyde, 1987; Gutkin & Wise, 1991; Moreland, 
1992). Much of this potential, however, is only beginning to be explored (Embret- 
son, 1992). For example, the application, of branching techniques and adaptive 
testing, whose advantages are now widely recognized in ability testing, is just begin- 
ning to find its way into personality testing (Ben-Porath & Butcher, 1986; Jackson, 
1985, 1991) and has not yet produced any instruments suitable for clinical use. 

On the other hand, computer technologies have led to the development and 
rapid proliferation of many new instruments for the assessment of cognitive func- 
tioning which are already being used in clinical neuropsychology, as well.as in the 
assessment of learning disabilities and attention disorders (see, e.g., Krug, 1988, 
1993; Stoloff & Couch, 1992). For the foreseeable future, it is unlikely that ade- 
quate assessment of neurocognitive functioning will be accomplished solely by 
means of computerized test administration (Golden, 1987). Nevertheless, com- 
puters do make it possible to vary task presentation conditions more precisely so 
as to assess performance on different task components. They also allow the 
recording and evaluation of response parameters—such as timing—in ways that 
are not possible with paper-and-pencil or even individual tests. The MicroCog: 
Assessment of Cognitive Functioning (MicroCog—Powell et al., 1993; Powell 
& Whitla, 1994a, 1994b) is an example of a recently developed. computer- 
administered battery designed to screen for possible signs of cognitive impairment 
in adults. MicroCog consists of 18 subtests—in the areas of attention and mental 
control, memory, reasoning and calculation, spatial processing, and reaction 
time—that make use of the unique capabilities of computer technology. Potential 
uses of this relatively quick and inexpensive battery include the assessment of mild 
to moderate levels of cognitive deterioration in older adults and monitoring cogni- 
tive performance among employees who might be exposed to noxious stimuli, as 
well as other instances requiring the precise evaluation of neurocognitive changes. 
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ey publishers and distributors of software for test profile reports, narrative reports, and other computer- 
assisted test interpretation options provide sample reports and demonstration disks to potential users. In addi- 
tion, the Psychware Sourcebook (Krug, 1993), which is one of the most complete sources of information currently 
available on computer-based assessment products, contains reproductions of many sample reports 
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Many other new instruments for the assessment of specific functions have also 
been developed and will undoubtedly continue to appear in years to come. 
Among the earliest measures of this nature are several continuous performance 
tests, such as the Test of Variables of Attention (TOVA—Leark, Dupuy, Green- 
berg, Corman, & Kindschi, 1996), which is available in an auditory as well as a 
visual version. Additional examples include the Vigil Continuous Performance 
Test (Cegalis, Cegalis, & Bowlin, 1993) and the Paced Auditory Serial Addition 
Test (PASAT—Cegalis & Birdsall, 1995), both of which involve multimedia soft- 
ware packages that allow for the assessment of attention in the visual and audi- 
tory modalities. Figure 17-5 shows a child taking the visual version of the TOVA. 

From another angle, computers also have the potential to integrate data from 
many sources, including all kinds of available tests, case history, and direct behav- 
ioral observations (see, e.g. Watkins & McDermott, 1991). Although the com- 
puter could thereby assume the clinician’s synthesizing function for the individual 
case, the computerized database necessary for developing and maintaining such 
integrative programs is still not available. 


17-5. AChild Taking the Test of Variables of Attention 
: : ternal Microswitch Mechanism 


Attached to One of the Computer's Ports. 


Courtesy of American Guidance Service.) 


(Coney of American Gunes Semis) 0 


Figure 
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Among current applications, there are also serious pitfalls to be considered 
(Moreland, 1992). Most available systems of computer-assisted test interpretation 
combine clinical and statistical procedures. The specific mix of quantitative data 
and clinical judgment varies among the systems, as do also the technical quality 
of the database and the clinical expertise of the judgments. Moreover, the infor- 
mation needed to evaluate a particular system is often unavailable because of pro- 

rietary concerns. 

` Teis largely because of the absence of the necessary technical information that 
the potential misuses of computerized test interpretation programs have aroused 
widespread concern (Eyde & Kowal, 1987; Fowler & Butcher, 1986; Matarazzo, 
1986a, 1986b; Moreland, 1987). Some programs already meet adequate scientific 
and professional standards or are being revised to meet that goal; several are of 
unknown quality and have never been reviewed by qualified professionals; far too 
many are being oversold through unverified claims in their promotional litera- 
ture. The first guidelines for the evaluation and use of computerized interpretive 
services were promulgated by the American Psychological Association in 1986; 
additions and revisions to these guidelines are being incorporated into the new 
Testing Standards, currently under preparation. Some additional guidance on the 
use of computerized psychological testing tools is available in a few other publica- 
tions (see, e.g., Bersoff & Hofer, 1991; Moreland, 1992). 


Concluding Comments. Altogether, the field of psychological assessment as 
practiced by various specialists is undergoing change as rapidly as the other areas 
considered in this chapter. In addition to the multitude of computer-based 
developments that are continually taking place, and to the other trends that have 
already been singled out throughout this last section of the chapter, there is a 
renewed emphasis on the need for assessment tools that are oriented toward 
positive mental health rather than psychopathology. One recent example is the 
Quality of Life Inventory (QOLI—Frisch, 1994), which is a measure of life 
satisfaction that can be used in treatment planning as well as in the evaluation of 
treatment outcomes. Several other measures of this type are currently in various 
stages of development. Another interesting new, and unique, kind of instrument 
is the Cross-Cultural Adaptability Inventory (OCAI—C. Kelley & Meyers, 
1993), which as its name implies is a self-assessment tool intended to help 
individuals decide on their readiness to adapt to other cultures. The Student 
Adaptation to College Questionnaire (SACQ—R. W. Baker & Siryk, 1989) is 
yet another tool which, like the CCAI, typifies the application of psychological 
testing to individual self-understanding and self-enhancement, an application 
that is a direct outgrowth of the influence of counseling psychology and that is 
likely to expand greatly in the future. 


Ethical and Social 
Considerations in Testing 


n both their research and the practical applications of their procedures, 
» psychologists have long been concerned with questions of professsional 
39 ethics. A concrete example of this concern is the systematic empirical pro- 

gram followed in the early 1950s to develop the first formal code of ethics for 
the profession. This extensive undertaking resulted in the preparation of a set of 
standards that was officially adopted by the American Psychological Association 
(APA) and first published in 1953. These standards undergo continual review and 
refinement, leading to the periodic publication of revised editions. The current 
version, Ethical Principles of Psychologists and Code of Conduct (APA, 1992),! com- 
prises a preamble and six general principles designed to guide psychologists to- 
ward the highest ideals of the profession. It also provides eight ethical standards 
with enforceable rules for psychologists functioning within diverse contexts. 

The Ethics Code is implemented by the APA Ethics Committee, which investi- 
gates and adjudicates complaints against members of the association. The rules 
and procedures for the operation of this committee as well as its yearly reports are 
published in the American Psychologist, the official journal of the American Psy- | 
chological Association (see, e.g; APA, Ethics Committee, 1995, 1996). An in- 
dispensable guidebook offering commentaty and illustrations on the application 
of the ethical standards has been prepared by former Ethics Committee members 


Vo an effort to disseminate the Ethical Principles of Psychologists and Code of Conduct—hereinafter referred to as 
the Ethics Code—as widely as possible, the APA will send a single copy of this publication, free of charge, to any- 


one who requests it. 
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who were involved in the latest revision of the Ethics Code (Canter, Bennett, 
Jones, & Nagy, 1994). Another useful collection of historical and contemporary 
readings on ethics in psychology, including discussions of ethical dilemmas in a 
variety of settings, has been recently compiled by Bersoff (1995). Pope and 
Vasquezs (1991) volume on Ethics in Psychotherapy and Counseling includes a 
chapter on assessment issues which provides some good practical advice. Weiner 
(1995) has recently written a helpful chapter on how to anticipate ethical and 
legal challenges in personality assessment. 

The 1990s have witnessed a rapid proliferation of legislative actions at the fed- 
eral and state levels, court decisions, and professional guidelines from a diversity 
of viewpoints, many of which impinge on the practice of psychology in general 
and on the use of psychological tests in particular. Several of these developments 
have already been discussed in earlier chapters in the context of specific types of 
testing issues and practices. All too often, the combined effect of these injunc- 
tions has involved confusion, inconsistencies, and conflicts for the practitioner. 

Increasingly, attention has focused on providers of psychological services who 
need to balance the ethical principles of their profession with legal and regulatory 
mandates, as well as with the institutional policies of the organizations where 
they work. Accordingly, the American Psychological Association, through its 
various offices and committees, has striven to provide guidance and information 
to members by monitoring pertinent developments and by promulgating stan- 
dards, guidelines, and statements on issues likely to present problems for practic- 
ing psychologists. The General Guidelines for Providers of Psychological Services 
(APA, 1987a) and the Specialty Guidelines for the Delivery of Services (APA, 
1981—currently under revision) were promulgated to aid those involved in the 
professional practice of psychology within several contexts. Additional guidance 
on specific issues is provided in other documents, such as the Guidelines for Child 
Custody Evaluations in Divorce Proceedings, drafted by the Committee on Profes- 
sional Practice and Standards (APA, COPPS, 1994), and several others men- 
tioned throughout this chapter. In addition, since the late 1980s, the APA has 
been preparing a series of volumes that summarize and analyze the laws affecting 
mental health professionals in each state; as of this writing, about a dozen of these 
volumes have been published and, in some cases, have been updated already (see, 
a & Pope, 1995; Petrila & Otto, 1995; Shuman, 1990, 1993; Wulach, 

The APA Committee on Psychological Tests and Assessment (CPTA) is spe- 
cifically devoted to considering problems regarding sound testing and assessment 
practices, and to providing technical advice concerning those practices to other 
APA groups. It has produced several documents, which will be mentioned later in 
this chapter, with guidance on test-related problems and practices. The CPTA 
also reviews the work of the Joint Committee on Testing Practices (JCTP), a 
group established by the APA and other professional organizations concerned 
with testing. In turn, the Joint Committee has developed a Code of Fair Testing 
Practices in Education (JCTP, 1988) and other products aimed at improving the 
manner in which tests are applied and at preventing their misuse (see, e.g., Eyde 
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et al., 1988, 1993). Currently, the JCTP is preparing a statement on the “Rights 
and Responsibilities of Test Takers.” 

The view from the test taker's perspective and certain aspects of the role of test 
users were considered in chapter 1. In this chapter, we turn our attention to ethi- 
cal and social issues that affect the use of tests across various contexts. In addition 
to matters concerning professional competence, we briefly discuss the responsibil- 
ities of test publishers, the test taker's right to privacy, the issue of confidentiality, 
and the testing of individuals of diverse backgrounds and levels of capability. Al- 
though we will deal with the impact of legislation to a certain extent, a detailed 
consideration of the multiplicity of legal aspects impinging on testing practice is 
well beyond our scope. For this, the interested reader is referred to the various 
sonra nente in this and other chapters of this text (see, especially, chaps. 9 
an , 


ETHICAL ISSUES IN PSYCHOLOGICAL TESTING 
AND ASSESSMENT 


Since the 1970s there has been a heightened concern not only with ethical prob- 
lems but also with. broader questions of values in all fields of both theoretical an: 
applied psychology (Bersoff, 1995; Diener & Crandall, 1978; Jacob & 
Hartshorne, 1991; Pope & Vasquez, 1991). In the testing area, thoughtful and 
provocative analyses of the role of values and of the underlying ethical rationale 
of various practices have been presented by Eyde and Quaintance (1988) and by 
Messick (1980b, 1989, 1995), At a more specific level, the APA Ethics Code con- 
tains much that is applicable to psychological testing. One of the standards— 
Evaluation, Assessment, or Intervention—is directly concerned with the devel- 
opment and use of psychological assessment techniques. Another one—Forensic 
Activities—contains a section devoted specifically to assessments in legal con- 
texts. In addition, the ethical standard on Privacy and Confidentiality, although 
broader in scope, is also highly relevant to testing, as are most of the other general 
principles and several of the ethical standards (APA, 1992). Some of the matters 
discussed in. the Ethics Code are closely related to points covered in the Testing 
Standards cited in. chapter 1. In fact, the content of the Testing Standards them- 
selves helps to define the professionally responsible use of tests. 

Besides the APA, other related professional groups and associations have de- 
veloped their own ethical codes and guidelines. Among the most pertinent of 
these, from the point of view of testing, is the statement entitl 
Users of Standardized Tests (the “RUST Statement”), adopted in 1989 by the 
American Coui 
ciples for the Validation and Use of Personnel Selection Procedures developed by the 
Society for Industrial and Organizational Psychology (SIOP—1987) for a more 
specialized purpose (see chap- 17). 

A significant event in clarifying the place of testing in modern society was the 


publication of Ability Testing: Uses, Consequences, and Controversies (Wigdor & 
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Garner, 1982). This two-volume book is the final report of a four-year project 
that examined the use of standardized ability tests in schools, in admission to 
higher education, and in employment testing. Begun at a time of widespread pub- 
lic debate about the value of testing, the project was directed by a multidiscipli- 
nary committee under the auspices of the National Research Council. Since the 
early 1980s, other important studies and reports concerning problem areas of 
testing have been published (see, e.g., Hartigan & Wigdor, 1989; Office of Tech- 
nology Assessment, 1992). In general, the findings of these various groups have 
substantiated and lent additional credence to well-established and often reiter- 
ated conclusions about both the contributions and potential misuses of ability 
tests. 

Government's increasing and pervasive involvement with the application of 
psychological tests and other assessment tools led to the creation of the Board on 
Testing and Assessment (BoTA), which was established in 1993 with the support 
of the U.S. departments of Defense, Education, and Labor (see Appendix B). The 
BoTA is an ongoing activity of the National Research Council. Its main objec- 
tives are to help policymakers understand and evaluate tests and other appraisal 
instruments used as tools of public policy. The Board concentrates on cross- 
cutting issues that affect testing and assessment in multiple contexts and has pub- 
lished reports on such topics as the Goals 2000 legislative initiative in education 
(Feuer & Kober, 1995) and the improvement plan for the General Aptitude Test 
Battery (BoTA, 1995), as well as the assessment and educational implications of 
Fee are taking place in the nature of work (Black, Feuer, Guidroz, & Les- 
gold, ] 


USER QUALIFICATIONS AND PROFESSIONAL COMPETENCE 


The Ethics Code principle on competence states that psychologists “provide only 
those services and use only those techniques for which they are qualified by edu- 
cation, training, or experience” (APA, 1992, p. 1599). With regard to tests, the 
requirement that they be used only by appropriately qualified examiners is one 
step toward protecting the test taker against the improper use of tests.2 Of course, 
the necessary qualifications vary with the type of test. Thus, a relatively long pe- 
tiod of intensive training and supervised experience is required for the proper use 


tional purposes are not usually equipped to administer the tests to others or to in- 
terpret the scores properly, 


es 
For a discussion ofthe role of the test wer; and of the Test User Qualifications and Test U, i 

d Traini: 
conducted by work groups of the JCTP (Eyde et al., 1988, 1993; Moreland et al., 1995), see Se 
Canadian Psychological Association and the British Psychological Society have also taken steps toward devel- 
oping systems for establishing test user qualifications (D. C. Brown, 1995), 


ue 
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Well-trained examiners choose tests that are appropriate for both the particu- 
lar purpose for which they are testing and the persons to be examined. They are 
also cognizant of the available research literature on the chosen test and able to 
evaluate its technical merits with regard to such characteristics as norms, reliabil- 
ity, and validity. In administering the test, they are sensitive to the many condi- 
tions that may affect test performance, such as those illustrated in chapter 1. They 
draw conclusions or make recommendations only after considering the test score 
(or scores) in the light of other pertinent information about the individual. 
Above all, they should be sufficiently knowledgeable about the science of human 
behavior to guard against unwarranted inferences in their interpretations of test 
scores. When tests are administered by psychological technicians or assistants, Or 
by persons who lack professional training in psychometric principles and proper 
assessment practices, it is essential that an adequately qualified psychologist be 
available, at least as a consultant, to provide the needed perspective for a proper 
interpretation of test performance. 

Who is a qualified psychologist? Obviously, with the diversification of the disci- 
pline and the consequent specialization of training, no psychologist is equally qual- 
ified in all areas, even within the narrower field of psychological testing and 
assessment (see chap. 17). In recognition of this fact, the Ethics Code calls for psy- 
chologists to “recognize the boundaries of their particular competencies and the 
limitations of their expertise” (APA, 1992, p. 1599). The implications of this ethi- 
cal obligation are spelled out in the competence principle, which was cited earlier. 

A significant step, both in upgrading professional standards and in helping the 
public to identify qualified psychologists, was the enactment of state licensing an 
certification laws for psychologists. All states, as well as the District of Columbia, 
now have such laws; most Canadian provinces have also enacted laws regulating 
the practice of psychology (for a summary of all these laws, see APA, 1993, pp. 
xlii-xlv). Although the terms “licensing” and “certification” are often used inter- 
changeably, in psychology, certification typically refers to legal protection of the 
title “psychologist,” whereas licensing controls the practice of psychology regard- 
less of the title by which the practitioner is identified. Licensing laws thus need to 
include a definition of the practice of psychology. Most states began with the sim- 
pler certification laws, but there has been continuing movement toward licensing. 
In either type of law, the requirements are generally a doctorate in psychology, @ 
specified: amount of supervised experience, and satisfactory performance on a 
qualifying examination. Licensing statutes typically include grounds for discipli- 
nary actions against psychologists, which can range from fines and reprimands to 
the suspension and revocation of the license. Many jurisdictions have incorpo- 
rated the ethical requirements of the APA into their statutes, either directly or 
indirectly, so that violations of the Ethics Code often constitute grounds for some 
form of disciplinary action. The APA also has developed a Model Act for State Li- 
censure of Psychologists (APA, 1987b) to serve as a prototype for the drafting of 
state legislation regulating the practice of psychology: 

At a more advanced level, specialty certification is provided by the American 
Board of Professional Psychology (ABPP—see Appendix B). Requiring à high 
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may also be obtained from ABPP. As a privately constituted federation of boards 
within the profession, ABPP does not have the enforcement authority available 
to the agencies administering the state licensing and certification laws. 

In the past decade, shifts in the health care delivery system and other changes 
in the professional marketplace have lent increasing urgency to the matter of cre- 
dentials for the practice of psychology. Therefore, the American Psychological 
Association has taken several steps to promote the orderly management of the 


user qualifications, as well as new procedures for credentialing assessment special- 
ists, will undoubtedly be forthcoming in the future. 


RESPONSIBILITIES OF TEST PUBLISHERS 


The purchase of tests is generally restricted to persons who meet certain minimal 
qualifications. The catalogs of major test publishers specify requirements that must 


test distributors may make sincere efforts to implement these objectives, the con- 
trol they are able to exert is necessarily limited. In some cases, it may not be feasi- 
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ble to investigate and verify the alleged qualifications of test purchasers (see, e.g., 
Oles & Davis, 1977). Moreover, the formal qualifications provide only a rough 
screening device. It is evident, for example, that an M.A. degree in psychology— 
or even a Ph.D., a state license, and an ABPP diploma—do not necessarily signify 
that the individual is qualified to use a particular test or that her or his training is 
relevant to the proper interpretation of the results obtained with that test. The 
major responsibility for the proper use of tests ultimately resides in the individual 
user or institution concerned. 

Another professional responsibility pertains to the marketing of psychological 
tests by authors and publishers. Tests should not be released prematurely for gen- 
eral use. Nor should any claims be made regarding the merits of a test in the ab- 
sence of sufficient objective evidence. When a test is distributed early for research 
purposes only, this condition should be clearly specified and the distribution of 
the test restricted accordingly. The test manual should provide adequate data to 
permit an evaluation of the test itself as well as full information regarding admin- 
istration, scoring, and norms. The manual should be a factual exposition of what 
is known about the test rather than a selling device designed to put the test in a 
favorable light. It is the responsibility of the test author and publisher to revise 
tests and norms often enough to prevent obsolescence. The rapidity with which a 
test becomes outdated, of course, varies widely with the nature of the test. 

Tests that need to be secure because of their use in selection, placement, or diag- 
nostic decisions should not be published in the popular media, in part or as a whole, 
for obvious reasons. Any publicity given to specific test items will tend to invalidate 
the future use of the test with other persons. In addition, publication of tests in the 
popular media may lead to psychologically injurious self-evaluation by members of 
the general public. Another practice that is nearly always unprofessional is testing by 
mail. Not only does this procedure provide no control of testing conditions, but usu- 
ally it also involves the interpretation of test scores in the absence of other pertinent 
information about the individual. With a few possible exceptions, such as the use of 
interest or values inventories with fairly sophisticated and well-motivated individu- 
als, the results of tests taken under these conditions may be worse than useless. 

It is noteworthy that since the 1980s, test publishers have begun to take steps 
toward ensuring that the tests they publish and distribute are properly used and 
the scores correctly interpreted. To this end they have made efforts to expand and 
improve communication with their clients about specific tests and to enhance 
public understanding of testing in general. Test publishers participated with the 
APA and other national organizations in the Test User Qualifications and Train- 
ing projects of the Joint Committee on Testing Practices (Eyde et al., 1988, 
1993). In addition, they have established the Association of Test Publishers 
(ATP—see Appendix B), an organization whose members are pledged to promote 
the integrity of assessment services and produéts and to enhance their value to 


*A policy statement regarding “take home" tests was recently prepared by rhe APA Ethics Committee in re- 
sponse to an inquiry concerning the propriety of sending the MMPI home for administration (APA, Ethics 


Committee, 1994, pp. 665-666). 
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society. The ATP has recently published the second edition of a set of Model 
Guidelines for Preemployment Integrity Testing (ATP, 1996). 


PROTECTION OF PRIVACY 


A question arising particularly in connection with personality tests is that of 
invasion of privacy. In a report entitled Privacy and Behavioral Research (1967), the 
right to privacy is defined as the right to decide for oneself how much one 
will share with others one’s thoughts, feelings, and facts about one’s personal life; 
this right is further characterized as “essential to insure freedom and self- 
determination” (p. 2). Insofar as some tests of emotional, motivational, or attitudi- 
nal traits are necessarily disguised, the examinee may reveal characteristics in the 
course of such a test without realizing that he or she is so doing. For purposes of 
testing effectiveness, it may be necessary to keep the examinee in ignorance of the 
specific ways in which the responses to any one test are to be interpreted. Never- 
theless, a person should not be subjected to any testing program under false pre- 
tenses. Of primary importance in this connection is the obligation to have a clear 
understanding with the examinee regarding the use that will be made of test results. 

Although concerns about the invasion of privacy have been expressed most 
commonly about personality tests, they logically apply to any type of test. Cer- 
tainly any intelligence, aptitude, or achievement test may reveal limitations in 
skills and knowledge that an individual would rather not disclose. Moreover, any 
observation of an individual's behavior—as in an interview, a casual conversa- 
tion, or other personal encounter—may yield information that the individual 
would prefer to conceal and that he or she may reveal unwittingly. The fact that 
psychological tests have often been singled out in discussions of invasions of pri- 
vacy probably reflects prevalent misconceptions about tests as well as their fre- 
quent misuse as a sole basis for decisions about individuals. If all tests were 
recognized as measures of behavior samples, with no mysterious powers to pene- 
trate beyond behavior, popular fears and suspicion would be lessened. Similarly, if 
tests were interpreted only within the context of comprehensive assessments, 
whenever decisions of importance to the individual are at stake, the undue 
amount of weight often placed upon any given test result would diminish. 

It should also be noted that all behavioral research, whether employing tests or 
other observational procedures, presents the possibility of invasion of privacy. Yet, 
as scientists, psychologists are committed to the goal of advancing knowledge 
about human behavior. Conflicts of values may thus arise, and they must be re- 
solved in individual cases.’ The problem is obviously not simple; and it has been 
the subject of extensive deliberation.Ó To safeguard personal privacy, no universal 


5The Ethical Principles in the Conduct of Research with Human Participants (APA, 1982) provide some guidance in 
this regard. 

SGee, for example, E. Allan Hanson’s (1993) critique of testing and its role in modern society. Although this trea- 
tise is clearly grounded in an ideology that is antithetical to testing and far from dispassionate, it may be of in- 
terest to readers from an anthropological perspective. 
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rules can be formulated; only general guidelines can be provided. Solutions must 
be worked out in terms of the particular circumstances of a case and in the 
light of the ethical awareness and professional responsibility of the individual 
psychologist. 

One relevant factor is the purpose for which the testing is conducted— 
whether for individual counseling, institutional decisions regarding selection and 
classification, or research. For example, in clinical or counseling situations, 
clients are usually willing to reveal themselves in order to obtain help with their 
problems. Whatever the purposes of testing, the protection of privacy involves 
two key concepts: relevance and informed consent. The information that the in- 
dividual is asked to reveal must be relevant to the stated purposes of the testing. 
An important implication of this principle is that all practicable efforts should be 
made to ascertain the validity of tests for the particular diagnostic or predictive 
purpose for which they are used. Recent legal developments, such as the Soroka v. 
Dayton Hudson case (see, e.g., Merenda, 1995) and the Americans with Disabili- 
ties Act of 1990 (P.L. 101-336), have underscored the importance of keeping the 
invasiveness of inquiries in preemployment testing to the minimum necessary and 
of ensuring the demonstrable relevance of such inquiries to job performance (see, 
e. g., Bruyère & O'Keeffe, 1994; D. C. Brown, 1996; Herman, 1994, chap. 2). In 
the Soroka case, job applicants challenged the use of a screening test on the 
grounds that its questions about religious beliefs and sexual preference—taken 
from the MMPI and the CPI—were invasive and discriminatory. Although the 
case was settled without a definitive ruling, several test developers—including 
the authors of the latest MMPI and CPI revisions—have already removed such 
items from self-report inventories (see chaps. 13 and 17). 

The concept of informed consent also requires clarification; and its application 
in individual cases calls for the exercise of considerable judgment (AERA, APA, 
NCME, 1985). Although the current Ethics Code contains an explicit standard 
requiring informed consent only for therapy, and not for assessment, such a re- 
quirement is implicit in other standards concerning evaluation and diagnosis in 
professional contexts, as well as in several other parts of the code. Furthermore, 
state psychology board regulations, case law, institutional rules, and/or prevailing 
standards of practice typically require informed consent in the context of both as- 
sessment and intervention activities (Canter et al., 1994, p. 67). The test taker 
should certainly be informed about the purpose of testing, the kinds of data 
sought, and the use that will be made of the scores. It is not necessary, however, 
that he or she be shown the test items in advance or told how specific responses 
will be scored. Nor should the test items be shown to a parent, in the case of a 


minor.” Such information would usually invalidate the test. These and other spe- 
d related issues in test- 


cial questions that may arise regarding informed consent an 
ing and assessment situations are considered in the Testing Standards’ chapter on 


the rights of test takers. 


"For guidance on consent for assessment and other ethical and legal issues in the psychological evaluation of mi- 
nors, see Kamphaus and Frick (1996, chap. 4). 
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CONFIDENTIALITY 


Like the protection of privacy, to which it is related, the problem of confidential- 
ity of test data is multifaceted. The fundamental question is: Who shall. have ac- 
cess to test results? Several considerations influence the answer in particular 
situations. Among them are the security of test content, the hazards of misunder- 
standing test scores, and the need for various persons to know the results. 

There has been a growing awareness of the right of individuals to have access 
to the findings in their own test reports. The test taker should also have the op- 
portunity to comment on the contents of the report and, if necessary, to clarify or 
correct factual information. Counselors are now trying more and more to involve 
clients as active participants in their own assessment. For these purposes, test re- 
sults should be presented in a form that is readily understandable, free from tech- 
nical jargon or labels, and oriented toward the immediate objective of the testing. 
Proper safeguards must be observed against misuse and misinterpretation of test 
findings. 

Discussions of the confidentiality of test records have usually dealt with acces- 
sibility to a third person, other than the individual tested (or parent of a minor) 
and the examiner. The underlying principle is that such records should not be re- 
leased without the knowledge and consent of the test taker, unless such a release 
is mandated by law or permitted by law for valid purposes. When tests are admin- 
istered in an. institutional context, as in a school system, court, or employment 
setting, the individual should be informed at the time of testing regarding the 
purpose of the test, how the results will be used, and their availability to institu- 
tional personnel who have a legitimate need for them. When test results are re- 
quested by outsiders, as when a prospective employer or a college requests test 
results from a school system, a separate consent for release of data is required. The 
same requirements apply to tests administered in clinical and counseling con- 
texts, or for research purposes. Further guidance on this issue is available in the 
Statement on the Disclosure of Test Data (APA, 1996) formulated by the CPTA to 
help psychologists deal with questions concerning the release of test records. In 
addition, the APA's Committee on Legal Issues (COLI) has devised some strate- 
gies for psychologists who must cope with subpoenas or compelled testimony con- 
cerning client records or test data in the course of their practices (APA, COLI, 
1996). 

Another problem pertains to the retention of records in institutions. On the 
one hand, longitudinal records on individuals can be very valuable, not only for 
research purposes but also for understanding and counseling the person. As is so 
often the case, these advantages presuppose proper use and interpretation of test 
results. On the other hand, the availability of old records opens the way for such 
misuses as incorrect inferences from obsolete data and unauthorized access for 
other than the original testing purpose. It would be manifestly absurd, for exam- 
ple, to cite an IQ or a reading achievement score obtained by a child in the third 
grade when evaluating her or him for admission to college. Similarly, when 
records are retained for many years, there is danger that they may be used for pur- 
poses that the test taker (or the test taker's parents) never anticipated and would 


Ethical and Social Considerations in Testing 551 


not have approved. To prevent such misuses, when records are retained either for 
legitimate longitudinal use in the interest of the individual or for acceptable re- 
search purposes, access to them should be subject to unusually stringent controls. 
Every type of institution should formulate explicit policies regarding the destruc- 
tion, retention, and accessibility of personal records. The ‘statement on Record 
Keeping Guidelines (APA, COPPS, 1993) contains more information on this 
topic. 


COMMUNICATING TEST RESULTS 


In recent years, psychologists have begun to give more thought to the communi- 
cation of test results in a form that will be meaningful and useful to the recipient. 
Certainly, the information should not be transmitted’ routinely, but should pro- 
vide appropriate interpretive explanations. Broad levels of performance and qual- 
itative descriptions in: simple terms are to be preferred over specific numerical 
scores, except when communicating with adequately trained professionals. Even 
well-educated laypersons have been known to confuse percentiles with percent- 
age scores, percentiles with IQs, norms with standards, and interest ratings with 
aptitude scores. But a more serious misinterpretation pertains to the conclusions 
drawn from test scores, even when their technical meaning is correctly under- 
stood. A familiar example is the popular assumption that an IQ indicates a fixed 
characteristic of the individual that predetermines her or his lifetime level of in- 
tellectual achievement. 

Among the possible recipients of test results, besides the test takers themselves, 
are parents of minors, teachers and other school personnel, employers, psychia- 
trists, and court and correctional personnel. In all test-related communication, it 
is desirable to take into account the characteristics of the person who is to receive 
the information. This applies not only to that person's general education and her 
or his knowledge about psychology and testing, but also to her or his anticipated 
emotional response to the information. In the case of a parent or a teacher, for ex- 
ample, personal emotional involvement with the child may interfere with a calm 
and rational acceptance of factual information. 

Similar problems are encountered when communicating test results to individ- 
ual test takers themselves, whether children or adults? The same general safe- 
guards against misinterpretation apply here as in communicating with a third 
party. In this regard, the Testing Standards stress the need for those who use tests in 
clinical and counseling applications to provide test takers with appropriate and 
understandable explanations of test results and of any recommendations stem- 
ming from them. 

The consideration of emotional reactions to test information is especially im- 
portant when persons are learning about their own assets and shortcomings. 


lk esu ede qp 

8The September 1992 issue of Psychological Assessment contains a special section on the topic of providing psy- 
chological test feedback to clients. Ina particularly helpful article, K- S. Pope discusses ten fundamental aspects 
of feedback, which he says “may be the most neglected aspect of assessment” (1992, p. 265). 
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When an individual is given her or his test results, not only should the data be in- 
terpreted by a properly qualified person, but facilities should also be available for 
counseling anyone who may become emotionally disturbed by such information. 
For example, a college student might become seriously discouraged when he 
learns of his poor performance on a scholastic aptitude test. A gifted schoolchild 
might develop habits of laziness and shiftlessness, or she might become uncooper- 
ative and unmanageable, if she discovers that she is much brighter than any of her 
peers. Such detrimental effects may, of course, occur regardless of the correctness 
or incorrectness of the score itself. Even when a test has been accurately adminis- 
tered and scored and properly interpreted, a knowledge of such a score without 
the opportunity to discuss it further may be harmful to the individual. 

Counseling psychologists have been especially concerned with the develop- 
ment of effective ways of transmitting test information to their clients (see, e.g., 
Hood & Johnson, 1997, chap. 17). Although the details of this process are be- 
yond the scope of our present discussion, two major guidelines are of particular in- 
terest. First, test reporting is to be viewed as an integral part of the counseling 
process and incorporated into the total counselor—client relationship. Second, in- 
sofar as possible, counselors need to engage their clients in interpreting test results 
in light of the specific questions raised by them. An important consideration in 
counseling relates to the client's acceptance of the information presented to her 
or him. The counseling situation is such that if the individual rejects any infor- 
mation, for whatever reasons, then that information is likely to be totally wasted. 
On the other hand, the acceptance of properly interpreted test findings can be 
of therapeutic value to a client—in and of itself—especially in the context of 
cognitively-oriented treatment. 


TESTING DIVERSE POPULATIONS 


The Setting. The decades since 1950 have witnessed an increasing public 
concern with the rights of ethnic minorities, women, and individuals with 
disabilities, as well as other minority groups.? This concern is reflected in the 
enactment of civil rights legislation at both federal and state levels. In 
connection with mechanisms for improving educational and vocational 
opportunities for individuals from these diverse groups, psychological testing has 
been a major focus of attention (Gifford, 1989a, 1989b). The psychological 
literature contains many discussions of the topic, whose impact ranges from 
clarification to obfuscation. Among the most clarifying contributions are several 
position papers and guidelines by professional associations (see, e.g., ACA, 1989; 
APA, Board of Ethnic Minority Affairs, 1990; APA, Division of Evaluation, 


?Although women represent a statistical majority in the U.S. population, legally, occupationally, and in other 
ways they have shared many of the problems of minorities. Hence when the term “minority” is used in this re- 
gard, it is understood to include women. 


——— 


Ethical and Social Considerations in Testing 559 


Measurement, and Statistics, 1993: Prediger, 1993; Sackett & Wilk, 1994). In 
addition, guidance on proper assessment practices with diverse populations is 
becoming increasingly available (see chap. 9; Dana, 1996a; Sattler, 1988, chaps. 
19 & 20; Suzuki et al., 1996; Valencia & Lopez, 1992). Reports prepared under 
the auspices of the National Research Council, the Office of Technology 
Assessment, and other such groups—cited in an earlier section of this chapter— 
have examined the controversies about tests in the light of the current social 
context and presented balanced views of the functions of testing. 

Much of the concern centers on the lowering of test scores by cultural condi- 
tions that may have affected the development of aptitudes, interests, motivation, 
attitudes, and other psychological characteristics of minority group members. 
Some of the proposed solutions to the problem reflect misunderstandings about 
the nature and function of psychological tests. Differences in the experiential 
backgrounds of groups or individuals are inevitably manifested in test perfor- 
mance. Every psychological test measures a behavior sample. Insofar as culture af- 
fects behavior, its influence will and should be detected by tests. If we could rule’ 
out all cultural differentials from a test, we might thereby lower its validity as a 
measure of the behavior domain it was designed to assess. In that case, the test 
would fail to provide the kind of information needed to correct the very condi- 
tions that impaired performance. 

The underlying theoretical rationale and testing procedures for different kinds 
of special populations were discussed more fully in chapters 9 and 12. A technical 
analysis of the concept of “test bias” was given in chapter 6, in connection with 
test validity. In this chapter, interest Centers primarily on the professional issues 
and social implications of minority group testing. 


Legal Regulations. Since 1960, there have been rapid developments pertain- 
ing to the educational and employment testing of minorities. These develop- 


ions, executive orders, and court decisions. Legisla- 


ments include legislative act ; P 9 and 
tion pertaining to educational testing was cited and reviewed in chapters ? an 


17; a review of current trends and issues in externally mandated testing is 
presented by Linn and Gronlund (1995, chap- ar , 
In the realm of employment, increasingly, the courts have played an important 
part in interpreting and applying civil rights laws. The implications of several ‘ae 
mous court cases have been widely discussed in the personnel and testing litera- 
ture by persons trained in psychology, law, or both (see, Gë: APA, CPTA, 1988; 
Bersoff, 1983, 1984; Bruyère & O'Keeffe, 1994; Hollander, 1982; Merenda, eat 
Meyers, 1992; Wigdor, 1982). The most pertinent federal legislation is provide 
by Tide VII of the Civil Rights Act of 1964 (PL. 88-352—also known as s 
Equal Employment Opportunity Act) with its subsequent amendments, the Civi 


Rights Act of 1991 (PL. 102-166), and the Americans with Disabilities Act of 
1990 (PL. 101-336). Responsibility for implementation and enforcement of these 


cio be maiè TASS 
VoDiscussions of some significant court decisions in the area of psychoeducational assessment can, be found in 


Ayers, Day, and Rotatori (1990) and Reschly (1988). 
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is vested. principally in the Equal Employment Opportunity Commission 
(EEOC), which develops and distributes guidelines for this purpose. In 1978, in 
the interests of simplified procedure and improved coordination, the Uniform 
Guidelines on Employee Selection Procedures were jointly adopted by the EEOC, the 
Civil Service Commission (now the U.S. Office of Personnel Management), and 
the departments of Justice, Labor, and Treasury.! 

The Equal Employment Opportunity Act prohibits discrimination. on the 
grounds of race, color, religion, sex, or national origin in selection procedures 
leading to employment decisions. These regulations apply to employers (both pri- 
vate and governmental), labor organizations, employment agencies, and licensing 
and certification boards. When the use of a test or other selection procedure re- 
sults in a substantially higher rejection rate for minority candidates than for non- 
minority candidates (“adverse impact"), "'? its utility must be justified by evidence 
of validity for the job in question. 

Historically, the requirements for acceptable test validation have been set 
through the Testing Standards, the Principles for the Validation and Use of Personnel 
Selection Procedures (SIOP, 1987), and other such documents from within the pro- 
fession. However, in the past two decades, there have been several cases in which 
extraneous legal considerations have intruded into psychometric practices, espe- 
cially in connection with civil rights. One of these cases is the agreement known 
as the “Golden Rule” (also see chap. 7). This agreement settled a dispute between 
the Golden Rule Insurance Company and the Educational Testing Service 
(ETS), concerning examinations prepared by ETS for the licensure of insurance 
agents. The settlement directed that priority be given to the use of test items with 
the smallest intergroup differences when differential rates of correct responding 
for majority and minority groups are present. Although it was intended to pro- 
mote fairness and minimize adverse impact, the Golden Rule settlement has 
prompted a heated debate concerning its assumptions about the nature of item 
bias and concerning the extent to which empirical evidence justifies the proce- 
dure the settlement suggests (APA, CPTA, 1988; Bond, 1987; Linn & Drasgow, 
1987; Rooney, 1987). 

In the discussion of affirmative action, the 1978 Uniform Guidelines point out 
that even when selection procedures have been satisfactorily validated, if dispro- 
portionate rejection rates result for minorities, steps should be taken to reduce 
this discrepancy as much as possible. Affirmative action implies that an organiza- 

. tion does more than merely avoid discriminatory practices. Psychologically, affir- 
mative action programs—which have come under increasing attack in the 
political arena in the past few years—may be regarded as efforts to compensate for 
the residual effects of past social inequities. The practice of subgroup norming, 


"The Uniform Guidelines have become dated and are clearly in need of review and revision. A revised version of 
the Guidelines may follow after the publication of the new Testing Standards, which is expected in the late 1990s 
(see chap. 1). 

"The inconsistencies in ways of assessing adverse impact in different court cases were analyzed by Lerner 
(1980a; see also Ironson, Guion, & Ostrander, 1982). 
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implemented with the GATB in the 1980s in order to produce comparable job re- 
ferral rates for White, Black, and Hispanic applicants in spite of the large discrep- 
ancies in their aptitude test scores (chap.17; also see Hartigan & Wigdor, 1989), 
was an example of affirmative action meant to reduce the adverse impact of a pre- 
employment test. This practice, however, generated such controversy that it led 
to the passage of the Civil Rights Act of 1991 (BL. 102-166), explicitly banning 
any form of score adjustment based on race, color, religion, sex, or national origin. 
In the realm of psychological testing, it is recognized that the ramifications of 
this Act "are more far-reaching than Congress envisioned" (D. C. Brown, 1994, 
p. 921) and could severely restrain the use of personality and physical ability tests 
that use separate norms for men and women (see, also, L.S. Gottfredson, 1994; 
Kehoe & Tenopyr, 1994; Sackett & Wilk, 1994). In fact, some test authors and 
publishers have already taken steps to provide alternative scoring procedures that 
eliminate the separation of norms by gender (see, e.g., Gough & Bradley, 1996). 

Another well-intentioned effort to do away with barriers to open opportunity 
for all that has created concern on the part of employers and others who are inter- 
ested in proper. preemployment testing practices is the Americans with Dis- 
abilities Act (ADA) of 1990 (P.L. 101-336). The ADA’s employment provisions 
prevent employers from, using medical tests or inquiring about substance abuse 
history or past psychiatric conditions before a job offer is made. The EEOC guide- 
lines and regulations on preemployment disability-related inquiries and medical 
examinations (1994, 1995) have so far left open the question of which psycholog- 
ical and personality tests are permissible in preemployment situations. 

Inconsistencies between professional, legal, and ethical injunctions are likely 
to keep emerging in the future (see, e.g. D. C. Brown, 1996). They will undoubt- 
edly make it more difficult to apply tests to decision making in the so-called “high 
stakes” areas of employment and education. To a large extent, this situation repre- 
sents an advance because it underscores the need for an explicit recognition that 
values are involved in all decision making of any consequence, whether in the 
scientific or in the practical arenas. In the words of Messick, “values are intzinsic 
to the meaning and outcomes of the testing and have always been. . . . This makes 
explicit what has been latent all along, namely, that validity judgments are value 
judgments" (1995, p. 148). Yet, even well-meaning and reasonable people can, 
and do, disagree sharply about values. Therein lies the difficulty. 


Test-Related Factors. In testing diverse persons, it is important to differenti- 
ate between factors that affect both test and criterion behavior and those whose 
influence is restricted to the test. It is the latter, test-related factors, that reduce 
validity. Examples of such factors include previous experience in taking tests, 
motivation to perform well on tests, rapport with the examiner, undue emphasis 
on speed, and any other variables affecting performance on the particular test but 
irrelevant to the broad behavior domain under consideration. Special efforts 
should be made to reduce the operation of these test-related factors when testing 
persons with dissimilar cultural backgrounds or persons with disabilities (see 
Sattler, 1988, chaps. 19 & 20). Adequate test-taking orientation procedures, 
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preliminary practice, and other such steps devised for this purpose are highly 
desirable (chaps. 1 and 9). 

Specific test content may also influence test scores in ways that are unrelated 
to the ability the test is designed to assess. In a test of arithmetic reasoning, for ex- 
ample, the use of names or pictures of objects unfamiliar in a particular cultural 
milieu would represent a test-restricted handicap. Another, more subtle way in 
which specific test content may spuriously affect performance is through the test 
taker’s emotional and attitudinal responses. Stories or pictures portraying typical 
suburban middle-class family scenes, for example, may alienate a child reared in a 
low-income inner-city home. The perpetuation of sex stereotypes in test content, 
such as in the portrayal of male doctors or pilots and female nurses or flight atten- 
dants, may also have detrimental effects. In light of these considerations, most 
test publishers now make special efforts to eliminate inappropriate test content. 
In fact, the review of test content with reference to possible negative implications 
for minority test takers is a common step in the test construction process (see, 
e.g., ETS Standards, 1981/1987). 

The testing of persons with diverse cultural backgrounds and experiential his- 
tories, as well as of individuals with disabilities, is a pervasive concern throughout 
the Testing Standards, This general orientation is reflected in several individual 
standards for test development and test usage. In addition, special chapters, with 
their own sets of standards, deal with issues in the testing of people with handi- 
capping conditions and linguistic differences, which represent sizable segments of 
the U.S. population. 


Interpretation and Use of Test Scores. By far the most important consid- 
erations in the testing of diverse groups—as in all testing—pertain to the inter- 
pretation of test scores. The most frequent misgivings regarding the use of tests 
with minority group members stem from misinterpretation of scores. If a minority 
test taker obtains a low score on an aptitude test or a deviant score on a 
personality test, it is essential to investigate why he or she did so. For example, an 
inferior score on an arithmetic test could result from low test-taking motivation, 
poor reading ability, or inadequate knowledge of arithmetic, among other reasons. 
Some thought should also be given to the type of norms to be employed in 
evaluating individual scores.!? 

Tests are designed to show what an individual can do at a given point in time. 
They cannot tell us why she performs as she does. To answer that question we 
need to investigate her background, motivations, and other pertinent circum- 
stances. Nor can tests tell how able a culturally or educationally disadvantaged 
child might have been if he had been reared in a more favorable environment. 
Moreover, tests cannot compensate for cultural deprivation by eliminating its ef- 
fects from their scores. On the contrary, tests should reveal such effects so that ap- 
propriate remedial steps can be taken. To conceal the effects of cultural 


134 special section of the December 1994 issue of Psychological Assessment was devoted to providing information 
and guidance on various aspects of normative assessment. 
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disadvantages by rejecting tests oF by trying to devise tests that are insensitive to 
such effects can only retard progress toward a genuine solution of social problems. 

The tendency to categorize and label, as a shortcut substitute for understand- 
ing, is still all too prevalent. The diagnostic categories of classical psychiatry, 
whereby patients were assigned such labels as "paranoid schizophrenic” or 
“manic-depressive,” are a well-known example of this tendency. Aware of the 
many deficiencies of such a system of classification, authors of the more recent di- 
agnostic manuals of psychiatry describe disorders of various types and attach the 
labels to the conditions rather than to the individuals who suffer from them (see, 
e.g, American Psychiatric Association, 1994). Moreover, psychologists them- 
selves have turned increasingly to personality descriptions. Unlike the diagnostic 
labels, these descriptions focus on the origins and individual significance of de- 
viant behavior and provide a more effective basis for therapy. But the traditional 
labels are not easily dislodged. 

Another example of the categorizing tendency is provided by misinterpreta- 
tions of the IQ. According to a popular misconception, the IQ is an index of in- 
nate intellectual potential and represents a fixed property of the organism. As was 
seen in chapter 12, this view is neither theoretically defensible nor supported by 
empirical data. When properly interpreted, intelligence test scores should not fos- 
ter a rigid categorizing of persons: On the contrary, intelligence tests—and any 
other test—may be regarded as a map on which the individual's present position 
can be located. When ‘combined with information about experiential back- 
ground, test scores should facilitate effective planning for the optimal develop- 
ment of the individual. 


Objectivity of Tests. When social stereotypes and prejudice may distort 
interpersonal evaluations, tests provide a safeguard against favoritism and arbi- 
trary and capricious decisions. As the civil rights movement gained momentum, 
several observers called attention to the positive function that standardized 
testing can serve. Commenting on the use of tests in schools, Gardner (1961, pp- 
48-49) wrote: “The tests couldn't see whether the youngster_was in rags or in 
tweeds, and they couldn’t hear the accents of the slum. The tests revealed 
intellectual gifts at every level of the population.” 

If tests were to be abolished, the need for making choices, by individuals as 
well as by organizations, would remain. Decision making would have to fall back 
on such long-familiar alternatives as letters of recommendation, interviews, and 
grade-point averages. Today these alternative data sources are often used in con- 
junction with test scores, but not in place of tests- In fact, standardized tests were 
introduced as one means of compensating for the unreliability, subjectivity, and 
potential bias of these traditional procedures. These alternati 
generally proved to be less accurate than tests in predicting school or job perfor- 
mance (Wigdor & Garner, 1982, Pt. I, chap. 1). More recently developed alterna- 
tive procedures, such as performance and portfolio assessment techniques, 
eventually may prove to have some advantages when compared with traditional 
tests. Thus far, however, research with these techniques suggests that they are 
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neither more valid nor more equitable across diverse populations than the stan- 
dardized tests they would supplement or replace (see chap. 17). 

Attacks on testing often fail to differentiate between the positive contributions 
of testing to fairness in decision making and the misuses of tests as shortcut substi- 
tutes for carefully considered judgment. Viewing testing in its social context, the 
Committee on Ability Testing (Wigdor & Garner, 1982, Pr. I) urged that tests be 
regarded neither as panaceas nor as scapegoats for society's problems and that so- 
cietal goals to increase opportunities for members of designated minority groups 
should not be confounded with the validity of the testing process. In a concluding 
statement, the committee observed, “The quest for a more equitable society has 
placed ability testing at the center of controversy and has given it an exaggerated 
reputation for good and for harm" (p. 239). This statement still is true and—in 
light of the dearth of viable alternatives—will most likely continue to be true for 
a long time to come. 

In summary, tests can indeed be misused with minorities, as with anyone else. 
Nevertheless, when properly used, they serve an important function in prevent- 
ing irrelevant and unfair discrimination. When evaluating the social conse- 
quences of testing, we need to assess carefully the social consequences of not 
testing and thus having to rely on other procedures for decision making that are 
less uniformly fair than testing. Furthermore, in determining the consequences of 
testing, we must be careful to differentiate the consequences of the proper use of 
tests from the consequences of their misuse and to separate the direct conse- 
quences of testing from those that are mediated by factors extraneous to the test- 
ing (Tenopyr, 1995). Otherwise, we are likely—entirely for the wrong reasons— 
to discard a tool that, albeit always in need of improvement, may prove to be 
irreplaceable. 
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Appendix A 


Alphabetical Listing of Test 
and Other Assessment Tools 


This table lists all tests discussed or mentioned in the text, except for: (a) out-of- 
print tests cited for historical reasons only; (b) tests that are not yet published; 
and (c) tests described in the literature and available only from their authors. For 
further information, see the Mental Measurements Yearbooks and other publica- 
tions from the Buros Institute of Mental Measurements, Test Critiques (Keyser & 
Sweetland, 1984-1994), as well as other sources cited in chapter 1. 


TEST NAME (ABBREVIATION) /PUBLISHER'S CODE* 


AAMR Adaptive Behavior Scale (ABS) / PRO-ED A 

ACT Assessment / ACT d 

Adjective Check List (ACL) | CPP 

Alcohol Use Inventory (AUI) / NCS 

Armed Forces Qualification Test (AFQT) / U.S. Military 

Armed Services Vocational Aptitude Battery (ASVAB) / U.S. Military 
Basic Personality Inventory (BPI) / Sigma 

Bayley Infant Neurodevelopmental Screener (BINS) / TPC 

Bayley Scales of Infant Development—Second Edition (Bayley-ll) | TPC 
Beck Depression Inventory (BDI) / TPC 

Behavior Assessment System for Children (BASC) / AGS 

Bender Visual Motor Gestalt Test (Bender-Gestalt) | WPS 

Bennett Mechanical Comprehension Test (BMCT) / TPC 


Boehm Test of Basic Concepts—Revised (Boehm-R) / TPC 
Bracken Basic Concept Scale (BBCS) / TPC 


1 Appendix B contains the complete names and addresses of the test publishers listed here; arranged in alphabet- 


ical order according to the Publishers' codes used in this Appendix. 
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Brief Symptom Inventory (BSI) / NCS 

British Ability Scales (BAS) / NFER-Nelson 

Bruininks-Oseretsky Test of Motor Proficiency / AGS 

California Achievement Tests—Fifth Edition (CAT) / CTB 

California Child Q-Set / CPP 

California Diagnostic Mathematics Tests (CDMT) [| CTB 

California Diagnostic Reading Tests (CDRT) [| CTB 

California Psychological Inventory— Third Edition (CPI-3) / CPP 
California Q-Sort Deck / CPP 

Campbell Interest and Skill Survey (CISS) / NCS 

Career Assessment Inventory— The Enhanced Version (CAI-EV) | NCS 
Career Assessment Inventory—The Vocational Version (CALVV) / NCS 
Career Beliefs Inventory (CBI) / CPP 

Career Development Inventory (CDI) | CPP 

Career Directions Inventory (CDI) / Sigma 

Career Interest Inventory (CII) / TPC 

Career Planning Program (CPP) / ACT 

Children's Apperception Test (C.A.T.) / CPS 

Cognitive Abilities Test (CogAT, Form 5) / Riverside 

College Level Examination Program (CLEP) / ETS 

Columbia Mental Maturity Scale (CMMS) / TPC 

Comprehensive Tests of Basic Skills—Fourth Edition (CTBS/4) / CTB 
Computer Programmer Aptitude Battery (CPAB) | SRA 

Concept Assessment Kit—Conservation (CAK) / EdITS 

Crawford Small Parts Dexterity Test (CSPDT) / TPC 

Cross-Cultural Adaptability Inventory (CCAI) / NCS 

CRT Skills Test / SRA 

Culture Fair Intelligence Test/IPAT 

Das-Naglieri Cognitive Assessment System (CAS) / Riverside 
Differential Ability Scales (DAS) / TPC 

Differential Aptitude Tests— Computerized Adaptive Edition (DAT Adaptive) / TPC 
Differential Aptitude Tests—Fifth Edition (DAT) / TPC 

Draw-a-Man Test (see Goodenough-Harris Drawing Test) 

Edwards Personal Preference Schedule (EPPS) / TPC 

Embedded Figures Test (EFT) / CPP 

Fagan Test of Infant Intelligence / Infantest 

Fleishman Job Analysis Survey (F-JAS) / MRI 

General Aptitude Test Battery (GATB) / USES 

Goodenough-Harris Drawing Test / TPC 

Graduate Record Examinations (GRE) / ETS 

Group Embedded Figures Test / CPP 

Guide to the Assessment of Test Session Behavior for the WISC-III and the WIAT [| TPC 
Guilford-Zimmerman Temperament Survey (GZTS) / CPP 
Halstead-Reitan Neuropsychological Test Battery (HRB) / RNL 

Haptic Intelligence Scale / Stoelting 

Harrington-O'Shea Career Decision-Making System—Revised (CDM-R) / AGS 
Hogan Personality Inventory—Second Edition (HPI) / HAS 
House-Tree-Person (H-T-P) / WPS 

Infant-Toddler Developmental Assessment (IDA) / Riverside 


Appendixes 567 


Iowa Tests of Basic Skills | Riverside 

lowa Tests of Educational Development / Riverside 

Jackson Personality Inventory—Revised (JPLR) / Sigma 

Jackson Vocational Interest Survey (JVIS) / Sigma 

Kaufman Adolescent and Adult Intelligence Test (KAIT) / AGS 
Kaufman Assessment Battery for Children (K-ABC) / AGS 

Kaufman Brief Intelligence Test (K-BIT)/ AGS 

Kaufman Test of Educational Achievement (K-TEA) / AGS 

Kuder General Interest Survey (KGIS) / CTB 

Kuder Occupational Interest Survey (KOIS) / CTB 

Kuder Preference Record— Vocational (KPR-V) / CTB 

Leiter International Performance Scale— Revised (LIPS-R) / Stoelting 
Luria-Nebraska Neuropsychological Battery (LNNB)/ WPS 
Machover Draw-a-Person Test (D-A-P) / Thomas 

McCarthy Scales of Children’s Abilities (MSCA) / TPC 
Metropolitan Achievement Test—Seventh Edition (MAT) / TPC 
Metropolitan Readiness Tests—Sixth Edition. (MRT) / TPC 
MicroCog: Assessment of Cognitive Functioning i TRE 

Millon Adolescent Clinical Inventory (MACI) / NCS 

Millon Adolescent Personality Inventory (MAPI) [| NCS 

Millon Clinical Multiaxial Inventory-lll (MCMLII) | NCS 

Millon Index of Personality Styles (MIPS) / TPC 

Minnesota Clerical Test (MCT) / TPC 

Minnesota Multiphasic Personality Inventory—Adolescent (MMPI-A) / UMP 
Minnesota Multiphasic Personality Inventory—2 (MMPI-2) | UMP 
Multidimensional Aptitude Battery (MAB) / Sigma 

Myers-Briggs Type Indicator (MBTI) / CPP 

Occupational Aptitude Survey and Interest Schedule— Second Ed. (OASIS-2) / PRO-ED 
Ordinal Scales of Psychological Development / UIP 

Orleans-Hanna Algebra Prognosis Test / TPC 

Otis-Lennon School Ability Test—Seventh Edition (OLSAT7) / TPC 
Paced Auditory Serial Addition Test (PASAT) / ForThought 

PDI Customer Service Inventory | PDI 

PDI Employment Inventory / PDI 

Peabody Picture Vocabulary Test—Revised (PPVT-R) / AGS 
Personality Assessment Inventory (PAI) / PAR ` 

Personality Inventory for Children—Revised (PIC-R) / WPS 
Personality Inventory for Youth (PIY) / WPS 

Personality Research Form (PRF) / Sigma 


Quality of Life Inventory (QOLI) | NCS 

Raven’s Progressive Matrices (RPM) / Oxford (U.S. Distributor: TPC) 
Revised Minnesota Paper Form Board Test (RMPFBT) / TPC 

Revised NEO Personality Inventory (NEO PI-R)/ PAR 

Roberts Apperception Test for Children (RATC) / WPS 

Rorschach / H & H 

Rotter Incomplete Sentences Blank (ISB) / TPC 

Schaie-Thurstone Adult Mental Abilities Test / CPP 

Schedule for Nonadaptive and Adaptive Personality (SNAP) / UMP 
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Scholastic Assessment Test (SAT) / ETS 

Self-Directed Search (SDS) / PAR 

Senior Apperception Test (S.A.T.) / CPS 

Sixteen Personality Factor Questionnaire—Fifth Edition (16PF) /IPAT 
Social Climate Scales / CPP 

Social Skills Rating System (SSRS) / AGS 

Stanford Achievement Test—Eighth Edition / TPC 

Stanford-Binet Intelligence Scale—Fourth Edition (SB-IV) / Riverside 
Stanford Diagnostic Mathematics Test—Third Edition (SDMT) / TPC 
Stanford Diagnostic Reading Test—Third Edition (SDRT) / TPC 
Stanford Writing Assessment Program / TPC 

State-Trait Anger Expression Inventory (STAXI) / PAR 

State-Trait Anxiety Inventory (STAI) / CPP 

State-Trait Anxiety Inventory for Children (STAIC) / CPP 

Strong Interest Inventory (SII) / CPP 

Structure of Intellect Learning Abilities Test / WPS 

Student Adaptation to College Questionnaire (SACQ) / WPS 

Student Self-Concept Scale (SSCS) / AGS 

Student Styles Questionnaire / TPC 

Symptom Checklist-90—Revised (SCL-90-R) / NCS 

System for Interactive Guidance Information—Revised (SIGI-PLUS) / ETS 
Teamwork-KSA / SRA 

Tell-Me-A-Story (TEMAS) / WPS 

TerraNova series / CTB 

Test Anxiety Inventory (TAI) / CPP 

Test of Cognitive Skills—Second Edition (TCS/2) / CTB 

Test of Nonverbal Intelligence—Second Edition (TONI-2) / PRO-ED 
Test of Variables of Attention (T.O.V.A) / UAD 

Tests of Achievement and Proficiency / Riverside 

Tests of Adult Basic Education (TABE) / CTB 

The Holtzman Inkblot Technique (HIT) / TPC 

Thematic Apperception Test (TAT) / Harvard 

The Rosenzweig Picture-Frustration Study (P-F Study) / PAR 

The Scenotest / H & H 

The Values Scale / CPP 

Vigil Continuous Performance Test (VIGIL) / ForThought 

Vineland Adaptive Behavior Scales (VABS) / AGS 

Washington University Sentence Completion Test (WUSCT) / Erlbaum 
Wechsler Adult Intelligence Scale—Revised (WAIS-R) / TPC 
Wechsler Individual Achievement Test (WIAT) / TPC 

Wechsler Intelligence Scale for Children— Third Edition (WISC-III) / TPC 
Wechsler Preschool and Primary Scale of Intelligence—Revised (WPPSI-R) / TPC 
Wonderlic Personnel Test / Wonderlic 

Woodcock-Johnson Psycho-Educational Battery—Revised (W]-R) / Riverside 
Woodcock Reading Mastery Tests—Revised / AGS 

Work Keys / ACT 
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AAMR 


ABPP 


ACA 


ACT 


AERA 


AGS 


Addresses of Test 
Publishers. Distributors, and 
Test-Related Organizations 


e — American Association on Mental Retardation 
444 North Capitol Street, N.W., Suite 846 
Washington, DC 20001-1512 


e — American Board of Professional Psychology 
2100 East Broadway, Suite 313 
Columbia, MO 65201-6082 


e American Counseling Association 
5999 Stevenson Avenue 
Alexandria, VÀ 22304-3300 


e American College Testing Program 
ACT National Office 
2201 North Dodge Street 
P.O. Box 168 
Iowa City, [A 52243-0168 


e — American Educational Research Association 
1230 Seventeenth Street, N.W. 
Washington, DC 20036-3078 


e — American Guidance Service, Inc. 
4201 Woodland Road 
Circle Pines, MN 55014-1796 
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APA 


ASC 


Buros 


CEEB 


CPP 


CPS 


EdITS 


EEOC 


American Psychological Association 
750 First Street, N.E. 
Washington, DC 20002-4242 


Assessment Systems Corporation 
2233 University Avenue, Suite 200 
St. Paul, MN 55114 


Association of Test Publishers 
655 Fifteenth Street, N.W., Suite 320 
Washington, DC 20005 


Board on Testing and Assessment 
National Research Council 

2101 Constitution Avenue, N.W. 
Washington, DC 20418 


Buros Institute of Mental Measurements 
P.O. Box 880348 

135 Bancroft Hall 

Lincoln, NE 68588-0348 


College Entrance Examination Board 
45 Columbus Avenue 
New York, NY 10023-6992 


Consulting Psychologists Press, Inc. 
3803 East Bayshore Road 

PO. Box 10096 

Palo Alto, CA 94303 


C.PS., Inc. 
P.O. Box 83 
Larchmont, NY 10538 


CTB/McGraw-Hill 
20 Ryan Ranch Road 
Monterey, CA 93940 


Educational and Industrial Testing Service 
P.O. Box 7234 
San Diego, CA 92167 


Equal Employment Opportunity Commission 
1801 L Street 
Washington, DC 20507 


Erlbaum 


ETS 


ForThought 


GRE 


Harcourt Brace 


Harvard 


HAS 


H&H 


Appendixes 
Lawrence Erlbaum Associates, Inc. 
10 Industrial Avenue 
Mahwah, N] 07430-2262 


Educational Testing Service 
Publications Order Services 
P.O. Box 6736 

Princeton, NJ 08541-6736 


ETS Test Collection 
Mailstop 30-B 

Rosedale Road 

Princeton, NJ 08541-0001 


ForThought, Ltd. 
Nine Trafalgar Square 
Nashua, NH 03063 


Graduate Record Examinations 
Educational Testing Service 
P.O. Box 6000 

Princeton, NJ 08541-6000 


Harcourt Brace Educational Measurement 
Educational Testing Division of "PC 

555 Academic Court ` - 

San Antonio, TX 18204-2498 í 


Harvard University Press 
79 Garden Street ` 
Cambridge, MA 02138 


Hogan Assessment Systems, Inc. 
P.O. Box 521176 
Tulsa, OK 74152 


Hogrefe & Huber Publishers 


United States Office: 
P.O. Box 2487 isa 
Kirkland, WA 98083. 


Swiss Office: 
Verlag Hans Huber 
Langgass-Strasse 76 
CH-3000 Bern 9 
Switzerland 
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IBM 


Infantest 


IPAT 


MRI 


NCME 


NCS 


NFER-Nelson 


Oxford 


PAR 


PDI 


PRO-ED 


x 


IBM K-12 Education 
4111 Northside Parkway 
Atlanta, GA 30327 


Infantest Corporation 
P.O. Box 18765 
Cleveland Heights, OH 44118-0765 


Institute for Personality and Ability Testing, Inc. 
P.O. Box 1188 
Champaign, IL 61824-1188 


Management Research Institute, Inc. 
6701 Democracy Blvd., Suite 300 
Bethesda, MD 20817 


National Council on Measurement in Education 
1230 Seventeenth Street, N.W. 
Washington, DC 20036 


National Computer Systems, Inc. 
PO. Box 1416 
Minneapolis, MN 55440 


NFER-Nelson Publishing Company, Ltd. 
Darville House, 2 Oxford Road East 
Windsor-Berkshire, SL4 1DF 

United Kingdom 


Oxford Psychologists Press, Ltd. 
Lambourne House 311 

321 Banbury Road 

Oxford OX2 7]H 

England 


Psychological Assessment Resources, Inc. 
P.O. Box 998 
Odessa, FL 33556-0998 


Personnel Decisions International 
2000 Plaza VII 

45 South Seventh Street 
Minneapolis, MN 55402-1608 


PRO-ED ; 
8700 Shoal Creek Boulevard 
Austin, TX 78757-6897 


Riverside 


The Score 


Sigma 


SilverPlatter 


SIOP 


SPA 


SRA 


Stoelting 
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The Riverside Publishing Company 
425 Spring Lake Drive 
Itasca, IL 60143 


Reitan Neuropsychological Laboratory 
2920 South Fourth Avenue 
Tucson, AZ 85713-4819 


Newsletter for Division 5 of the American Psychological 
Association 

4201 Woodland Road 

Circle Pines, MN 55014 


Sigma Assessment Systems, Inc. 


United States Office: 
1110 Military Street 


:P.O. Box 610984 


Port Huron, MI 48061-0984 


Canadian Office: 

Research Psychologists Press, Inc. 
650 Waterloo Street, Suite 100 
PO. Box 3292, Station B 
London, ON N6A 4K3 


SilverPlatter Information 
100 River Ridge Drive me 
Norwood, MA 02062-5045 - 


Society for des and Organizational Psychology, Inc. 
P.O. Box 87 ` : : 
Bowling Green, (OH 43402 


Society for Personality Anbaadent 
150 First Street, N.E. 
Washington, DC 20002-4242 


McGraw-Hill/London House 

SRA Business and Industry Assessments _ 
9701 West Higgins Road 

Rosemont, IL 60018-4720 


Stoelting Company 
620 Wheat Lane 
Wood Dale, IL 60191 
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Swets 


Thomas 


TPC 


UAD 


UIP 


UMP 


USES 


U.S. Military 


Wonderlic 


Swets Test Services 
Heereweg 347b 
2161 CA Lisse 
Nederland 


Charles C. Thomas Publisher 
2600 South First Street 
Springfield, IL 62794-9265 


The Psychological Corporation 
555 Academic Court 
San Antonio, TX 78204-2498 


Universal Attention Disorders, Inc. 
4281 Katella #215 
Los Alamitos, CA 90720 


University of Illinois Press 
1325 South Osk Street 
Champaign, IL 61820 


University of Minnesota Press 

Test Division 

111 Third Avenue South, Suite 290 
Minneapolis, MN 55401 


United States Employment Service 

Western Assessment Research and Development Center 
140 East 300 South 

Salt Lake City, UT 84111 


United States Military Entrance Processing Command 
Attn.: Technical Directorate 

2500 Green Bay Road 

North Chicago, IL 60064-3094 


Wonderlic Personnel Test, Inc. 
1509 North Milwaukee Avenue 
Libertyville, IL 60048-1380 


Western Psychological Services 
12031 Wilshire Boulevard 
Los Angeles, CA 90025-1251 
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Experimental psychology, influences on early 
test development, 48 


Face validity, 131-132 
Factor analysis, 55, 302 
cognitive task analysis and, 334 
composition of a test, 321—322 
construct validity and, 142-143 
factor loadings and correlation, 322-323 
factor matrix, 317-318 
of intelligence, 317-323 
interpretation of, 320-321 
oblique axes and second-order factors, 323 
teferences axes, 318-320 
self-report personality inventories and, 
376-381 
Factorial validity of a test, 142, 320 
Factor matrix, 317-318 
Fagan Test of Infant Intelligence, 259, 566 
Fair use of tests, decision models for, 184—185 
Faking 
projective techniques and, 447-448 
self-report personality inventories and, 
388-390 
False acceptances, 158 
False rejections, 159 
Feedback, test scores affected by, 33 
Fels Research Institute, 339, 340 
Field dependence, 460-462 
Finding Information, 42 
Five-Factor Model (FFM), 378-381, 520 
Fixed reference group, 85-87 
Fleishman Job Analysis Survey (F-JAS), 508, 
566 
Flexibility of closure, 459 
Forced-choice techniques, 390-391 
Forensic psychology, 523n 
Free association test, 58, 439-440 
Frequency distribution, 63-65 


Frequency polygon, 64 
Functional literacy, 497n 


g factor, 324-325 
Gender-related issues, 421-423 
General Aptitude Test Battery (GATB), 40, 
173, 512—513, 539, 566 
General factor, 324 
Generalizability theory of reliability, 99n 
Generalized linear item-response theory 
(GLIRT), 206n 
Gerontological Apperception Test, 437 
Gesell Development Schedules, 71, 250-251 
Golden Rule case, 213-214, 560 
Goodenough-Draw-a-Man Test, 33, 278-280 
Goodenough-Harris Drawing Test, 278-280, 
444, 566 
Gottschaldt Figures, 460 
Grade norms, 70 
Graduate Record examination Board, 38, 
39-40, 501 
Graduate Record Examinations (GRE), 501, 
566, 571 
Group(s) 
item discrimination and extreme, 196-197 
item discrimination and small, 197—199 
validity and contrasted, 136 
Group Embedded Figures Test, 460, 461, 462, 
566 
Group factor, 325 
Group testing 
adaptive testing, 288-292 
advantages of, 287 
disadvantages of, 287-288 
measuring multiple aptitudes, 301—307 
multilevel batteries, 292-301 
origin, 152—153 
role of, 52-53 
versus individual testing, 286—288 
Guided assessment, 503 
Guides to the Assessment of Test Session Be- 
havior of the WISC-III and the WIAT, 
521, 566 
Guilford-Zimmerman Temperament Survey, 
311, 566 
Guttman scale, 71n, 91n, 420 


Halo effect, 480 

Halstead-Reitan Neuropsychological Test 
Battery (HRB), 531, 566 

Haptic Intelligence Scale, 271n, 566 

Harrington-O'Shea Career Decision-Making 
System-Revised (CDM-R), 410, 540, 566 

Hayes-Binet, 270 P 


Head Start, 339, 344 

Hearing impaired, testing of, 268-269, 566 

Heritability of intelligence, 311-314 

Heteroscedasticity, 157 

Hierarchical theories, 330-332 

Hiskey-Nebraska Test of Learning Aptitude, 
269 

Histograms, 64 

Hogan Personality Inventory (HPI), 522, 566 

Holtzman Inkblot Technique (HIT), 432—433, 
568 

Home Observation for Measurement of the 
Environment (HOME), 284 

Homogeneity of test items, 111-112 

Hopkins Symptom Checklist, 364 

House-Tree-Person, 444, 566 

How to Take a Test: Doing Your Best (Dobbin), 


40 
Human figure drawings (HFDs), 444 
Human Intelligence International Newsletter, 313 


IBM, 504n, 572 
LE Scale (Internal-External Scale), 422-423 
Illusory validation, 452 
Impression management, 389 
In-basket test, 507 
Incremental validity, 162 
Index of discrimination, 199-200 
Index of reliability, 114 
Individual with Disabilities Education Act 
(IDEA), 261, 266-267 
Individual testing, group testing versus, 
286-288 
Industrial use of tests, 504-523 
Infant and preschool testing. See also Child- 
hood, early 
administration of, 249-250 
Bayley Scales of Infant Development, 251-253 
current trends in, 258-261 
historical background of, 250-251 
McCarthy Scales of Children’s Abilities 
(MSCA), 253-254 
Piagetian scales, 254-258 
predictive validity of, 342 
types of tests, 250 
Infant Psychological Development Scales, 
255-256, 566 
Infant-Toddler Development Assessment 
(IDA), 259-260 
Inferential statistics, 67n 
Information Bulletin (Graduate Record Exami- 
nation Board), 39 
Informed consent, 555 


Inkblot techniques 


Index 607 


alternative approaches, 430—432 
Exner’s Comprehensive System, 428-430 
Holtzman, 432-433 
» Rorschach, 19, 209, 425-428 
Instability of intelligence test performance, rea- 
sons for, 340-341 
Institute for Personality Assessment and 
Research (IPAR), 466, 471-472 
Integrity tests, 521 ; 
Intelligence 
defined, 308-309, 334-336, 511 
in early childhood, 341-344 
factor analysis of, 317-323 
heritability and modifiability, 311-314 
longitudinal studies of children's, 337-341 
motivation and, 314-316 
personality and, 316 
problems in testing adult, 345-351 
trait organization theories, 323-332 
traits, role of, 332-336 
use of academic, 511-512 
Intelligence quotient (IQ) 
deviation, 78-80 
first use of, 52 
meaning of, 309-311 
Intelligence Test of Visually Impaired Children 
(ITVIC), 271 
Intelligence test performance 
instability of, 340-341 
population changes in, 351-354 
stability of, 338-340 
Intelligence tests. See also under name of 
Binet, 51-52 
clinical and counseling testing and, 526-528 
group, 52-53 
use of, 219 
Interactive computer systems, 89 
Intercept bias, 182-184 
Interest tests 
Career Assessment Inventory-Vocational 
Version, 414 
current setting, 401—403 
Jackson Vocational Interest Survey, 411-412 
Kuder Occupational Interest Survey, 
412-413 
opinion surveys and attitude scales, 418-423 
overview of, 410-416 
Self-Directed Search, 414-416 
sex fairness and, 402 
Strong Interest Inventory, 403-410 
trends, 416-420 
Interitem consistency, measures of, 111-1 12 
Internal consistency, 143 


Internal-External Scale (LE Scale), 422-423 
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Interval scales, item difficulty and, 188 
Interviewing techniques, 478-479 
Iowa Tests of Basic Skills, 496, 567 
Iowa Tests of Educational Development, 496, 567 
Ipsative scores, 384—385 
IQ. See Intelligence quotient 
Item analysis 
cross-validation, 208-210 
current developments in, 214-216 
differential item functioning, 210-214 
item difficulty, 187-193 
item discrimination, 193-201 
item response theory, 87-88, 201-206 
role of, 186 
of speeded tests, 207-208 
Item characteristic curve (ICC) theory, 
203-205, 211 
Item difficulty 
distribution of test scores, 191—192 
interval scales, 188 
percentage passing, 187—188 
testing purpose and, 193 
Thurstone absolute scaling, 188—191 
Item discrimination 
biserial correlation, 201 
choice of criterion, 193-196 
extreme groups used in, 196-197 
index of discrimination, 199-200 
phi coefficient, 200-201 
small groups used in, 197-199 
statistical indices of, 196 
Item information functions, 205 
Item response theory (IRT), 87-88 
basic features, 203-205 
current status of, 206 
item-test regression, 201—202 
models, 205-206 
Item sampling, and test reliability, 108-111 
Item-test regression, 201-202 
Item writing, 498 


Jackson's Personality Inventory-Revised (JPI-R), 
387, 567 
Jackson's Basic Personality Inventory, 372, 565 
Jackson Vocational Interest Survey (JVIS), 
411-412, 567 
J-coefficient, 509 
Job analysis method, 507-508 
Job element method, 507-508 
Job performance 
criteria, 135-136, 509-510 
predicting, 508-509 
Job samples, 506 
Joint Service Job Performance 


Measurement/Enlistment Standards 
(JPM) project, 514-515 
Journal of Personality Assessment, 424, 449n 


Kaufman Adolescent and Adult Intelligence 
Test (KAIT), 238-239, 567 

Kaufman Assessment Battery for Children 
(K-ABC), 236-238, 250, 567 

Kaufman Brief Intelligence Test (K-BIT), 239, 
567 


Kaufman sales, development of, 236 

Kaufman Test of Educational Achievement 
(K-TEA), 534, 567 

Kent-Rosanoff Free Association Test, 440 

Kinetic Family Drawing (KFD), 444—445 

Kuder General Interest Survey (KGIS), 413, 567 

Kuder Occupational Interest Survey (KOIS), 
412-413, 567 

Kuder Preference Record-Vocational, 412—413, 
56 


7 
Kuder-Richardson reliability, 111—113 
Kuhlmann-Binet, 52 


Lambda coefficient, 413n 
Language, in transcultural testing, 357-359 
Latent trait models, 87 
Latent trait theory, 203-205 
Leaderless Group Discussion (LGD), 466 
Leadership, testing for, 521-522 
Learning disabilities (LD) 
assessment techniques, 533-535 
defined, 532 
dynamic assessment, 535-536 
identifying, 532-536 
Learning Plus program, 504n 
Learning-potential assessment, 535—536 
Legal regulations 
civil rights and testing, 558-561 
ethical issues and, 559-561 
learning disabilities and, 532-533 
physical disabilities and, 266-268 
Leiter International Performance Scale, 272 
Leiter International Performance Scale-Re- 
vised, 275-277, 567 
Leniency error, 480. 
Licensing of psychologists, 551—552 
Lie detector, 439-440 
Life-history data, 483-485 
Life Values Inventory, 401 
Likert-type scale, 420-421 
Linear transformations, 75-76 
Local norms, 85 
Locus of control, 422-423 
Longitudinal studies of children’s intelligence, 


337-341 
Luria-Nebraska Neuropsychological battery 
(LNNB), 531—532, 567 


Machover Draw-a-Person Test (D-A-P), 
443-444, 567 
Mastery testing 
domain-referenced testing and, 92-95 
reliability and, 126 
McCarthy Scales of Children’s Abilities 
(MSCA), 253-254, 567 
Mean, 65, 66 
Mean square deviation, 67 
Measures of Personality and Social Psychological 
Attitudes, 421 
Measures of styles and types 
cognitive styles, 458-462 
personality types, 462-464 
Mechanical aptitude tests, 517 
Median, 65, 66 
Mediated learning experience, 357 
Mental age, 51-52, 69-70 
Mental level, 51, 52 
Mental Measurements Yearbook (MMY), 41—42, 
218 
Mental retardation 
assessment of, 261—266 
early classification of, 47-48 
Mental test(s) 
early, 50-51 
first use of term, 50 
Meta-analysis, 139—140 
Metropolitan Achievement Test, 84, 567 
Metropolitan Readiness Tests, 504, 567 
MicroCAT, 291 
MicroCog: Assessment of Cognitive Func- 
tioning, 544, 567 
Millon Adolescent Clinical Inventory 
(MACI), 383—384, 567 
Millon Adolescent Personality Inventory 
(MAPI), 383—384, 567 
Millon Clinical Multiaxial Inventory-III 
(MCMIII), 381—384, 567 
Millon Index of Personality Styles (MIPS), 
384, 567 
Minimum competency tests in basic skills, 
497-498 
Minnesota Clerical Test (MCT), 518, 567 
Minnesota Multiphasic Personality Inventory 
(MMPI, MMPI-2, MMPI-A), 88, 
365-373, 567 
adolescent (MMPI-A), 371 
comments about, 371—373 
description of, 365-366 
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revision (MMPI-2), 366-371 
Minnesota Paper Form Board Tests-Revised, 
517 
Minorities, testing of. See Multicultural testing 
Mode, 65, 66 
Moderator variables, 169—170 
Moderation, intelligence and, 314-316 
Motor impaired, testing of, 271-273 
Multicultural testing 
approaches to, 281-283 
description of problem, 273-275 
ethical issues, 558-564 
impact of environment on, 283-284 
others names for, 274n 
traditional methods, 275-280 
Multidimensional Aptitude Battery (MAB), 
304-307, 332, 567 
Multilevel batteries 
overview of, 292-293 
recognition of multiple aptitudes, 300-301 
representative, 293-294 
test content at different levels, 294—300 
Multiple aptitude batteries/tests, 55 
differential aptitude tests, 124, 302-304 
measuring, 301-307 
multidimensional aptitude tests, 304-307 
Multiple choice items, 286, 492 
Multiple correlation, 172 
Multiple discriminant function, 175-176 
Multiple-factor theories, 325-328 
Multiple regression equation, 171-173 
Multitrait-multimethod matrix, 144 
Myers-Briggs Type Indicator (MBTI), 391, 
463-464, 567 


Narrative computer interpretation, 88 

National Adult Literacy Survey, 497-498 

National anchor norms, 84-85 

National Assessment of Educational Process, 
90, 492 

National Council on Measurement in Educa- 
tion (NCME), 43, 572 

National References Scale, 84 

National Research Council, 550, 559 

Naturalistic observation, 477-478 

Navy Classification Battery, 33 

NEO Personality Inventory (NEO PLR), 
379-380 

Neuropsychological assessment, 528-532 

Nominating technique, 481-482 

Nomothetic span, 149 

Nonllanguage tests, role of, 248 

Nonlinear transformations, 76 

Nonreading tests, 248-249 
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Nonverbal tests, role of, 248-249 
Normal curve, 64-65, 192 
Normalized standard scores, 76-77 
Normal percentile charts, 74 
Norm-referenced testing, 93-94 
Norms 

developmental, 68-72 

grade, 70 

local, 85 

national anchor, 84-85 

projective techniques and, 448—449 

relativity of, 80-88 

role of, 20-21, 62 

statistical concepts, 63-68 

subgroup/specific, 85 

within-group, 72-80 


Objective questions on tests, 492 
Objectivity, in psychological tests, 21 
Oblique axes, 323 
Observer reports, 477-483 
Occupational Aptitude Pattern (OAP), 512-513 
Occupational Aptitude Survey and Interest 
Schedule-2 (OASIS-2), 410, 567 
Occupational testing. See also Career; Interest 
tests 
personality testing in the workplace, 519-523 
use of, 510-519 
validation of employment tests, 505—510. 
Occupational themes, 404—409, 412, 413, 414, 
415 
Office of Strategic Services (OSS), 
assessment-center program, 466 
On Your Own: Preparing for a Standardized Test, 
40 


Open-ended tasks, 493 

Opinion surveys and attitude scales, 418—422 

Ordet-of-merit procedures, 480 

Ordinal scales, 70—72 

Ordinal Scales of Psychological Development, 
255-256, 567 

Orleans-Hanna Algebra Prognosis Test, 502, 
567 

Orthogonal axes, 323 

Oseretsky Test of Motor Proficiency, 266 

Otis-Lennon School Ability Test (OLSAT), 
294, 295, 300, 496, 567 

Otis Self-Administering Tests of Metal Ability, 
286, 511 


Places Auditory Serial Addition Test (PASAT), 
545, 567 

Paper-and-pencil tests, 517—518, 521 

Pattern analysis. See Profile analysis 


PDI Customer Service Inventory, 522, 567 
PDI Employment Inventory, 522, 567 
Peabody Picture Vocabulary Test (PPVT-R), 
265, 272-273, 567 
Pearson Product-Moment Correlation Coeffi- 
cient, 102, 104, 155 
Peer assessments, 481—482 
Percentage passing, item difficulty and, 
187-188 
Percentiles, 72-75 
Perceptual functions, 459-460 
Performance techniques. See also Job performance 
drawing techniques, 443-445 
global, 506—507 
play and toy techniques, 445-446 
Performance tests, role of, 59, 135, 248 
Perkins-Binet Test of Intelligence for the Blind, 
270 
Personal and Academic Self-Concept Inventory 
(PASCI), 471 
Personal constructs, 474-475 
Personal Data Sheet, 58, 363, 364-365 
Personality, intelligence and, 316 
Personality Assessment Inventory (PAI), 
372-373, 567 
Personality inventories, self-report 
California Psychological Inventory, 373-374 
content-related procedures, 363-364 
current status of, 399 
empirical criterion keying, 364-376 
factor analysis, 376—381 
Minnesota Multiphasic Personality Inven- 
tory, 88, 365-373 
Personality Inventory for Children, 374-376 
personality theory, 381—388 
situational specificity, 393-398 
test-taking attitudes and response biases, 
388-393 
Personality Inventory for Children-Revised 
(PIC-R), 374-376, 482n, 567 
Personality Inventory for Youth (PIY), 376, 
482n, 567 
Personality Research From (PRF), 385-388, 
567 
Personality tests, 58-59 
in the workplace, 519-523 
Personality theory 
Edwards Personal Performance Schedule, 
384-385 
Millon Clinical Multiaxial Inventory, 
381-384 
Personality Research From, 385-388 
Personality types, 462—464 
Personnel selection, use of tests in, 158-178, 


504-523 
Phi coefficient, item discrimination and, 
200-201 
Physical disabilities, testing persons with 
hearing impairments, 268-269 
legislation and issues regarding, 266—268 
motor impairments, 271-273 
visual impairments, 270-271 
Piagetian ordinal scales, 141 
Piagetian scales, 254—258 
Pictorial classification tests, 273 
Pictorial techniques 
Children's Apperception Test, 436 
Gerontological Apperception Test, 437 
Roberts Apperception Test, 436 
Rosenzweig Picture-Frustration Study, 437-439 
Senior Apperception Test, 437 
TEMAS, 436-437 
Thematic Apperception Test, 433-437 
Picture vocabulary tests, 272 
Pintner-Paterson Performance Scale, 268 
Placement, use of term, 174 
Play and toy techniques, 445-446 
Populations 
intelligence test performance and changes in, 
351-354 
samples versus, 82-84 
Porteus Mazes, 272-567 
Portfolio assessment, 493 
Positive manifold, 319 
Power test, 116 
Practicing to Take the GRE General Test, 39 
Predictive validity in infant and preschool 
testing, 342 
Preschool testing. See Infant and preschool testing 
Preselection, 155-156 
Primary mental abilities, 325-327 
Principles for the Validation and Use of Personnel 
Selection Procedures, 549, 560 
Privacy issues, 554—555 
Probationary appointment, 506 
Productivity, validity and, 163, 165 
Profile analysis 
cutoff scores and, 173-174 
of intelligence test results, 526 
Profiles, examples 
aptitude test, 124-125 
personality test, 369-370 
Prognostic testing, 502-503 
Project A, 138, 515 
Project Head Start, 339 
Projective techniques 
autobiographical memories, 441-443 
as clinical tools, 455—456 
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evaluation of, 446—456 
inkblot techniques, 425—433 
performance techniques, 443-446 
pictorial techniques, 433-439 
as psychometric instruments, 454-455 
role of, 59, 424-425 
verbal techniques, 439-441 
Protocol analysis, 149, 334 
Psychiatric diagnosis, 137 
Psychological assessment, 524-525 
role of computers in, 543-546 
Psychological tests. See also under name of test 
administration of, 27-31 
capacity, 19 
defined, 18-23 
diagnostic/predictive value of, 18-19 
history of, 46-51 
misuses, 24-27, 561-563 
objectivity of, 21 
reasons for controlling the use of, 24-27 
reasons for using, 16-18, 488 
reliability of, 21-22 
sources of information, 41—43 
standardization, 20-21 
uses, 16-18 
validity of, 22-23 
varieties, 51-59 
Psychologists 
first experimental, 48 
qualifications and professional competence 
issues, 550-552 
Psychomotor skills and testing, 516-517 


Q sort, 472-473, 483 
Quality of Life Inventory (QOLI), 546, 567 


Range, 66 
Rapport 
establishing, 29-31 
projective techniques and, 447 
Rasch model, 205-206, 302 
Rater relevant contact, 480 
Ratings 
criterion measurements, 137 
errors, 480-481 
as an observation technique, 479-481 
Raven's Progressive Matrices (RPM), 277-278, 
325, 351, 567 
Raw scores, 20, 63 
Readiness tests, 503-504 
Reference axes in factor analysis, 318-320 
References groups, fixed, 85-87 
Regression equations, 17 1-173 
Rehabilitation Act (1973), 267 
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Reliability 
ability levels and, 121 
alternate-form, 107-109 
coefficient, 104-105 
cutoff scores and, 126 
defined, 21-22, 98 
Differential Ability Scales, 244-245 
index of, 114 
Kuder-Richardson, 111—113 
mastery testing and, 126 
overview of, 114—116 
protective techniques and, 449—450 
retest, 450 
scorer, 113-114 
speeded tests and, 116-118 
split-half, 109—111 
standard error of measurement, 121-125 
Stanford-Binet, 225-226 
test length and, 110 
test-retest, 105-107 
variability and, 119-121 
Wechsler Intelligence Scales, 233-234 
Report writing, in psychological assessment, 
542-543 
Rep Test,, 474-475 
Response capabilities, 511 
Response sets and styles, 391-393 
Responsibilities of Users of Standardized Tests 
(RUST statement), 549 
Retest reliability, 450 
Revised Minnesota Paper Form Board Test 
(RMPFBT), 517, 567 
Revised NEO Personality Inventory (NEO PI- 
R), 379-380, 567 
R-I-A-S-E-C scales, 417—418 
Roberts Apperception Test for Children 
(RATC), 436, 567 
Role Construct Repertory Test (Rep Test), 
414-415 
Roleplaying techniques, 467—468 
Rulon formula, 112n 
Rorschach, 19, 209, 425-428, 567 
comprehensive system, Exner's, 428-430 
consensus Rorschach, 431 
evaluation, 429-430, 431 
various approaches, 430-432 
Roan Picture-Frustration Study, 437-439, 
68 
Rotter Incomplete Sentences Blank (RISB), 
440-441, 567 


Sample-free scaling, 87-88 
Sample heterogeneity, 155 


Samples, versus populations, 82-84 
Sampling 
of behavior, 18-19. 
of content, 108 
of persons, 82-84 
Scatter, age scales, 69 
Scenotest, 445, 568 
Schaie-Thurstone Adult Mental Abilities Test, 
348n, 567 
Schedule for Nonadaptive and Adaptive Per- 
sonality (SNAP), 383n, 567 
Scholastic Aptitude Test (SAT), 26. See also 
Scholastic Assessment Test 
declining scores, 352-353 
impact of coaching on, 38 
renaming of, 86n, 489n 
Scholastic Assessment Test (SAT), 76, 568 
changes in, 489n, 499-500 
recentered score scale for, 499—500 
scale continuity in, 86 
for the visually impaired, 270 
School readiness, 503—504 
School Vista software, 504n 
Science Research Associates (SRA), 519 
Scorer reliability, 113-114 
Scores, types of, 68-80 
Screening, 174 
Seattle Longitudinal Study (SLS), 347-348 
Second-order factors, 323 
Seguin Form Board, 48 
Selected-response tasks, 493, 494, 495 
Selection, use of term, 174 
Selection ratio, 160-162 
Self-concepts and personal constructs, 468—477 
Adjective Check List, 471-472 
environmental perception, 475-477 
Q sort, 472-473 
Role Construct Repertory Test, 474—475 
self-esteem inventories, 469—471 
Semantic Differential, 473-474 
Washington University Sentence Comple- 
tion Test, 441n, 468-469 
Self-Directed Search (SDS), 414-416, 568 
Self-esteem inventories, 469-471 
Self-Esteem Scale (SES), 470 
Self-exploration, 401 
Self-fulfilling prophecy, 32 
Self-report, client, 538 
Self-report inventory, 58. See also Personality 
inventories, self-report 
Semantic Differential, 473-474 
Senior Apperception Test, 437, 568 
Sentence completion, 440-441 
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Sequential patterning, 141 
Sequential strategies, decision theory and, 
167-168 
Shrinkage, validity, 210 
Significance levels, 104 
Simple structure, 319 
Simulation tests, 506 
Situational specificity, self-report personality 
tests and, 393-398 
Situational tests 
assessment-center programs and roleplaying, 
466-468 
Character Education Inquiry, 465—466 
role of, 59, 464—465 
Sixteen Personality Factor Questionnaire 
(16PF), 377-378, 568 
Skewed distribution curves, 191-192 
Slope bias, 179-182 
Social Climate Scales, 476-477, 568 
Social desirability, self-report personality inven- 
tories and, 388—390 
Social Skills Rating System (SSRS), 538, 568 
Society for Industrial and Organizational 
Psychology (SIOP), 505, 549, 573 
Society for Personality Assessment, 542n, 573 
Soroka v. Dayton Hudson, 555 
Spatial aptitude tests, 517-518 
Spatial Learning Ability Test, 215 
Spearman-Brown formula, 110, 112n, 115, 
117, 186 
Special aptitude testing, 53-56, 515-519 
Speed tests 
definition, 116 
item analysis of, 207-208 
reliability and, 116-118 
Spiral-omnibus format, 286—287 
Split-half reliability, 109—111 
Standard deviation (SD), 67-68 
Standard error measurement (SEM) 121-125 
Standard error of estimate, 157 
Standard error of score differences, 123-125 
Standardization, test, 20-21 
Standardization sample, 20 
Standardized achievement tests. See Achieve- 
ment tests 
Standard scores, 75-78 
Standards for Educational and Psychological 
Testing (APA) 43-45, 132, 140, 488, 
505, 524, 549, 557, 560 
Stanford Achievement Test, 56, 496, 568 
Stanford-Binet Intelligence Scale, 52, 54, 56, 
250, 526, 568 
administration and scoring, 221-224 
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as a criterion in validating group tests, 137 
evolution of scales, 219-221 
fourth edition (SB-IV) 221—228 
reliability, 225-226 
standardization and norms, 224-225 
validity, 226-228 
for the visually impaired, 270 
Stanford Diagnostic Mathematics Test, 502, 
568 
Stanford Diagnostic Reading Test, 502, 568 
Stanford Writing Assessment Program, 502, 
568 
Stanine scale, 77 
States versus traits, 398 
State-Trait Anger Expression Inventory 
(STAXI), 398, 568 
State-Trait Anxiety Inventory (STAI), 398, 568 
State-Trait Anxiety Inventory for Children 
(STAIC), 398, 568 
Statistical concepts, 63-68 
Statistical significance, 102-104, 123-125 
Stereotype accuracy, 451-452, 464 
Stereotype vulnerability, 357 
Strong Interest Inventory (SII), 568 
from T317, 404-408 
origins and development of, 403—404 
psychometric evaluation, 409-410 
scoring and interpretation, 408—409 
Strong Vocational Interest Blank (SVIB), 403 
Structural equation modeling, 146-148 
Structure of Intellect Learning Abilities Test, 
328, 568 
Structure-of intellect model, 328-330 
Student Adaptation to College Questionnaire 
(SACQ), 546, 568 , 
Student Self-Concept Scale (SSCS), 470-471, 
568 
Student Styles Questionnaire, 463, 568 . 
Study of Values, 401n 
Subgroup/specific norms, 85 
Suppressor variables, 172-173 
Symptom Checklist-90-Revised (SCL-90-R), 
363-364, 568 
Synthetic validation, 509 
System for Interactive Guidance Information 
(SIGI), 89, 568 
SIGI Plus, 540 


T score, 77 

Taking the SAT (College Board), 39 
Task decomposition, 149 

Taylor-Russell tables, 160—162 
Teacher-made classroom tests, 498-499 
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Teamwork-KSA Test, 519, 568 
Technical Recommendations for Psychological Tests 
and Diagnostic Techniques (APA), 140 
TEMAS (Tell-Me-A-Story), 436—437, 568 
Terra Nova series, 497n, 568 
Test(s). See also under name of 
anxiety, 34-36, 397, 494 
classification and use of, 174-178 
communication, 26 
environment, 28-29 
equating, 88 
familiarization materials, 31, 39-40 
information curve, 123 
methods of introducing, 29-31 
sophistication, 38—40 
sources of information about, 41-43 
specifications, 129, 498 
Test Anxiety Inventory (TAI), 36, 397, 568 
Test Anxiety Profile, 397 
Test batteries 
defined, 170 
multiple regression equation, 171-173 
profile analysis and cutoff scores, 173-174 
Test bias 
differential item functioning, 210-214 
intercept, 182-184 
self-report personality tests and response 
biases, 388-393 
slope, 179-182 
statistical analysis of, 178-185 
Test ceiling, effect on score distribution, 
191-192 
Test Collection Bibliographies (ETS), 42 
Test construction, validity and, 151-152 
Test content, security of, 26-27 
Test examiners 
advanced preparation of, 27-28 
cultural diversity and, 359-360 
need for qualified, 24-25 
projective techniques and, 448 
test scores affected by, 31-33 
Test fairness 
decision models for, 184—185 
use of term, 183 
Test floor, effect on score distribution, 191-192 
Testing the limits, 535 
Test’ of Cognitive Skills (TCS), 294, 297, 299, 
300, 568 
Test of Nonverbal Intelligence (TONI-2), 280 
Test of Variables of Attention (TOVA), 545, 
568 
Test publishers responsibilities of, 552—554 
Test results, communicating, 557—558 
Test-retest reliability, 105-107 


Test scores, factors affecting 
answer sheets, types of, 28-29 
coaching, 37-38 
feedback, 33 
test examiners, 31-33 
training/coaching, 37-41 
Test scores, interpreting 
computers used for, 88-90 
cutoff scores, use of, 94-97 
domain-referenced, 90-94 
ethical issued, 559-564 
norm-referenced, 93-94 
standard error measurement, 121-125 
Tests in Microfiche, 42 
‘Tests in Print, 42 
Tests of Achievement and Proficiency, 496, 568 
Test of Adult Basic Education (TABE), 498, 568 
Test takers, views of, 34-37 
Test-taking attitudes and response biases, 
388-393 
Test user, role of, 25-26 
Test User Qualification Working Group 
(TUQWoG), 25 
Test User Training Working Group 
(TUTWoG), 25, 26 
Thematic Apperception Test (TAT) 433-437, 
568 


Thurstone absolute scaling, 188-191 

Thurstone-type scale, 420 

Time-lag comparisons, 347 

Training/coaching, test scores affected by, 37— 
41 


Trait organization theories, 323 
hierarchical theories, 330-332 
multiple-factor theories, 325-328 
structure-of-intellect model, 328-330 
two-factor theory, 324-325 

Traits 
tole of, 332-336 
self-report personality inventories and 

situational specificity, 393-398 
states versus, 398 

T scores, 77 

Two-Factor Index of Social Position, 283 

Two-factor theory, 324-325 


Uniform Guidelines on Employee Selection Proce- 
dures (1978), 560 
US. Air Force, 77, 175 
US. Army 
Alpha and Beta test, 53, 248, 285 
Aptitude Area, 177 
Selection and Classification, Project, 138, 
175, 510, 515 


US. Civil Service Commission, 57 

US. Department of Defense, 502, 514 

US. Department of Labor, 512-513 

US. Employment Service (USES), 40, 173, 
512, 539—540, 574 

US. Office of Education, 84 

U.S. Office of Personnel Management, 502, 507 

US. Office of California, Services (OSS), 59, 466 

University of California, Institute for Personality 
Assessment and Research, 466, 471-472 

Utility, decision theory and, 165—167 


Validity 
comparison of different types, 150-153 
consequences of testing, 152-153 
construct, 128 
construct-identification procedures, 140-149 
content-description, 128-132 
convergent and discriminant, 143-144 
criterion-prediction, 132-140 
cross-validation, 208-210 
decision theory and, 158-170 
defined, 127 
differential, 175 
Differential Ability Scales, 245-246 
early uses of, 128 
face, 131-132 
factorial, 142-143 
group, 136 
impact of, on individuals and society, 152-153 
incremental, 162 
productivity and, 163, 165 
projective techniques and, 451—453 
of psychological tests, 22-23 
shrinkage, 210 
Stanford-Binet, 226-228 
test construction and, 151-152 
Wechsler Intelligence Scales, 234-235 
Validity coefficient, 22-23 
conditions affecting, 155—157 
error of estimate and, 155-158 
magnitude of, 157-158 
Validity generalization (VG), 138-139, 513 
Values, role of, 400 
Values Scale, 401, 568 
Variability, 66-67 
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reliability and, 119-121 
Variance, 67 
Verbal techniques, 439-441 
Vigil Continuous Performance Test, 545, 568 
Vineland Adpative Behavior Scales, (VABS), 
263-265, 568 
Vineland Social Maturity Scale, 263 
Visually impaired, testing of, 270-271 
Vocational interests. See Career; Interest tests 


Washington University Sentence Completion 
Test (WUSCT), 441n, 468-469 
Wechsler Adult Intelligence Scale (WAIS-R), 

89, 229-230, 305 
Wechsler-Bellevue Intelligence Scale, 229 
Wechsler Individual Achievement Test 
(WIAT), 535, 568 
Wechsler Intelligence Scales, 76, 79, 526 
abbreviated scales/short forms, 231, 233 
age decrement, 345-347 
description of scales, 230-231 
development of, 228-230 
norms and scoring, 233 
reliability, 233-234 
validity, 234-235 
for the visually impaired, 270 
Wechsler Intelligence Scale for Children 
(WISC-R, WISC-III), 89, 229, 230, 568 
Performance Scale for Deaf Children, 268-269 
Wechsler Preschool and Primary Scale of Intel- 
ligence (WPPSI, WPPSI-R), 230, 250, 
568 
Within-group norms, 72-80 
Wonderlic Personnel Test, 511, 568 
Woodcock-Johnson Psycho-Education 
Battery-Revised, 253n, 534, 568 
Woodcock Reading Mastery Tests-Revised, 
246, 568 
Woodworth Personal Data Sheet, 58, 363, 
364-365 
Word association test, 439-440 
Work Keys system, 508, 568 
Work roles, 411 
Work styles, 411 


zscores, 75 
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